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PREFACE TO THE 
REVISED EDITION 


The purpose of this second edition is simple. It is to make readily accessible 
some newer meta-analytic procedures that have become available since the 
publication of the first edition. These newer procedures along with the basic 
procedures also described will make it ppssible for readers to conduct their 
own meta-analyses and to evaluate more wisely the meta-analyses conducted 
by others. 

In the years since publication of the first edition Donald B. Rubin has 
continued to tutor me in matters meta-analytic and otherwise quantitative. 
When we collaborate, and we frequently do, he does all the work that is hard 
(for me, not for him) and original. Then he insists we publish alphabetically. 
What a country! 

—Robert Rosenthal 








PREFACE TO THE 
FIRST EDITION 


My interest in meta-analysis grew out of a research result I couldn’t quite 
believe. For reasons chronicled elsewhere (Rosenthal, 1985b), I conducted 
some studies to investigate the effect of psychological experimenters’ expec¬ 
tations on the responses obtained from* their research subjects. These studies 
suggested that experimenters’ expectations might indeed affect the results of 
their research. In the late 1950s that result was not very plausible — not to me 
and not to my colleagues. 

Along series of replications followed which eventually persuaded me that 
there must be something to the phenomenon of interpersonal expectations. 
Since the early 1960s I have been combining and comparing the results of 
series of research studies dealing with experimenters’ and others’ expecta¬ 
tions (Rosenthal, 1961, 1963). The basic quantitative procedures for com¬ 
bining and comparing research results were available even then (Mosteller 
& Bush, 1954; Snedecor, 1946). 

In the mid-1960s I began teaching a variety of meta-analytic procedures 
in courses on research methods though they were not then called meta- 
analytic. Neither this teaching nor my writing employing meta-analytic 
procedures seemed to have much effect on the probability of others’ employ¬ 
ing these procedures. What did have an effect on others’ employing meta- 
analytic procedures was an absolutely brilliant paper by Gene V Glass 
(1976a, 1976b). In this paper, Glass named the summarizing enterprise as 
“meta-analysis” and gave an elegant example of a way of doing a meta¬ 
analysis. In the process, he persuaded me, a former psychotherapist, that 
I probably had helped those patients I’d thought I’d helped. 

Since this early work by Glass, and the subsequent work with his col¬ 
leagues frequently cited in this book, there has been an extraordinary rate of 
production of meta-analytic research. Since the late 1970s, there have been 
hundreds of published and unpublished meta-analyses. 

The table of contents and the introductory chapter tell in detail what is in 
this book. Its purpose, very briefly, is to describe meta-analytic procedures 
in sufficient detail so that they can be carried out by readers of this book and 
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so that they can be wisely evaluated when they have been carried out by 
others. 

The book was designed to be used by advanced undergraduate students, 
graduate students, and researchers in the social and behavioral sciences. The 
level of mathematical sophistication required is high school algebra. The 
level of statistical sophistication required is about half-way through a second 
course in data analysis (e.g., Rosenthal and Rosnow, 1984a; 1991). 

I am doubly grateful to the National Science Foundation: first, for having 
frequently supported, since 1961, the substantive research on interpersonal 
expectations-the research that gave me something to meta-analyze; and 
second, for having supported in part the development of some of the meth¬ 
odological procedures to be described in this book. 

Let me also thank especially these people: Frederick Mosteller for having 
markedly enlarged my horizons about meta-analytic procedures some 20 
years ago; Jacob Cohen, a fine colleague I have never met, but whose writings 
about power and effect size estimation have influenced me profoundly; and 
Donald B. Rubin, a frequent collaborator and my long standing tutor on 
matters meta-analytic and otherwise quantitative. I have described our col¬ 
laboration to students as follows: “I ask him questions and he answers them. 
Clearly, the ideal collaboration! 

The manuscript was improved greatly by the suggestions of Len Bickman, 
Debra Rog, Harris Cooper, and an anonymous reviewer, and it was superbly 
typed by Blair Boudreau, whose legendary accuracy ruins one’s skill at, and 
motivation for, proofreading. 

Finally, I thank MaryLu Rosenthal for what she taught me about biblio¬ 
graphic retrieval in the social sciences (M. Rosenthal, 1985) and for the 
countless ways in which she improved the book and its author. 


-R.R. 


A NOTE ON THE 
REVISED EDITION 


This revision makes readily accessible some newer meta-analytic procedures 
that have been developed since the 1984 edition. Anew effect size indicator, 
jt, for one-sample data is introduced as is a new coefficient of robustness of 
replication. Procedures for combining, and comparing effect sizes for multi¬ 
ple dependent variables are described and new data are reported on the 
magnitude of the problem of incomplete retrieval (the file drawer problem). 
Finally, new results are provided on the social, psychological, economic, and 
medical importance of small effect sizes. 
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Introduction 


Two sources of pessimism in the social sciences are discussed. Early examples of meta- 
analytic procedures are given that illustrate (1) summarizing relationships, (2) determin¬ 
ing moderator variables, and (3) establishing relationships by aggregate analysis. The 
current status of meta-analytic procedures is described and an empirical evaluation of the 
employment of meta-analytic procedures is provided. 


There is a chronic pessimistic feeling in the social and behavioral sciences 
that, when compared to the natural sciences, our progress has been exceed¬ 
ingly slow, if indeed there has been any progress at all. From time to time 
this chronic state erupts into an acute condition, or crisis, precipitated in 
part by “local” (i.e., disciplinary) developments. For example, in the disci¬ 
pline of social psychology, the precipitating factors leading to prolonged 
crisis have been brilliantly analyzed by Ralph Rosnow (1981) in his book, 
Paradigms in Transition . It seems a good bet, however, that had we been 
doing better as a science on a chronic basis, our acute crisis would have been 
less severe. Two general purposes of this book are to describe quantitative 
procedures that will show (1) how we can “do better” than we have been 
doing, and (2) how we have, in fact, been “doing better” than we think we 
have been doing. 


L TWO SOURCES OF PESSIMISM 
IN THE SOCIAL SCIENCES 1 


LA. Poor Cumulation 

One of the two sources of pessimism in the social sciences, the one which 
is the focus of this book, is the problem of poor cumulation. This problem 
refers to the observation that the social sciences do not show the orderly 
progress and development shown by such older sciences as physics and chem- 
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istry. The newer work of the physical sciences builds directly upon the older 
work of those sciences. The social sciences, on the other hand, seem almost to 
be starting anew with each succeeding volume of our scientific journals. 

While it appears that the natural and physical sciences have problems of 
their own when it comes to successful cumulation (Collins, 1985; Hedges, 
1987; Hively, 1989; Koshland, 1989; Mann, 1990; Pool, 1988,1989; Taubes, 
1990), there is no denying that in the matter of cumulating evidence we have 
much to be modest about. Poor cumulation does not seem to be due primarily 
to lack of replication or failure to recognize the need for replication. Indeed, 
the calls for further research with which we so frequently end our articles are 
carried wherever our scholarly journals are read. It seems rather that we have 
been better at issuing such calls than at knowing what to do with the answers. 
There are many areas of the social sciences for which we do have the results 
of many studies all addressing essentially the same question. Our summaries 
of the results of these sets of studies, however, have not been nearly as 
informative as they might have been, either with respect to summarized 
significance levels or with respect to summarized effect sizes. Even the best 
reviews of research by the most sophisticated workers have rarely told us 
more about each study in a set of studies than the direction of the relationship 
between the variables investigated and whether or not a given p level was 
attained. This state of affairs is beginning to change. More and more reviews 
of the literature are moving from the traditional literary format to the 
quantitative format (for overviews see Cooper, 1984,1989b; Glass, McGaw, 
& Smith, 1981; Hedges & Olkin, 1985; Hunter & Schmidt, 1990; Hunter, 
Schmidt, & Jackson, 1982; Light & Pillemer, 1984; Mullen, 1989; Mullen & 
Rosenthal, 1985; Rosenthal, 1980,1984). 

Three more specific purposes of this book relevant to the problem of 
poor cumulation include the following: 

(1) Defining the concept of a study’s “results” more clearly than is our custom in 

the social sciences. . . 

(2) Providing a general framework for conceptualizing meta-analysis, i.e., the 
quantitative summary of research domains. 

(3) Illustrating the quantitative procedures within this framework so they can be 
applied by the reader and/or understood more clearly when applied by oth¬ 
ers. 

I.B. Small Effects 

The second source of pessimism in the social sciences on which we focus 
in this book is the problem of small effects . Even when we do seem to come 
up with a possibly replicable result, the practical magnitude of the effect is 
almost always small, i.e., accounts for only a trivial proportion of the variance. 
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Thus, the complaint goes, even if some social action program works, or if 
some new teaching method works, or if psychotherapy works, the size of the 
effect is likely to be so small that it is of no practical consequence whatever. 

One specific purpose of this book is to describe a procedure for helping 
us to evaluate the social importance of the effects of any independent varia¬ 
ble. This is done in detail in the final chapter. 

II. EARLY EXAMPLES OF 
META-ANALYTIC PROCEDURES 

Early applications of meta-analytic procedures were of three types. The 
first type was that in which the goal was to summarize for a set of studies what 
the overall relationship was between two variables that had been investigated 
in each study. Often this goal was approached by trying to estimate the aver¬ 
age relationship between two variables found in a set of studies. Often this 
goal was approached by significance testing, i.e., by trying to determine the 
probability that the relationship obtained could have been obtained if, in the 
population from which the studies had been sampled, the true relationship 
were zero. 

The second type of early application of meta-analytic procedures was not 
so much concerned with summarizing the relationship between two varia¬ 
bles, but with determining the factors that were associated with variations 
in the magnitude of relationships between the two variables (i.e., the factors 
that served as moderator variables). 

The third type of early application did not examine any relationship 
within each study. Instead, each study provided only aggregated data for 
each variable, for instance, the average attitude held by the participants in a 
study or their average level of cognitive performance. These aggregated or 
averaged data were then correlated with each other or with other character¬ 
istics of the study to test hypotheses or to suggest hypotheses to be tested in 
subsequent specifically designed studies. 

To summarize the differences among these three types of early application 
of meta-analytic procedures we can say that: (1) the first type generally re¬ 
sulted in an estimate of the average correlation (or the combined p level asso¬ 
ciated with that correlation) found in all the studies summarized; (2) the sec¬ 
ond generally resulted in a correlation between some characteristic of the 
studies and the correlation (or other index of the size of the effect) found in 
the studies; and (3) the third simply correlated mean data obtained from each 
study with other mean data or with other characteristics obtained from each 
study. We turn now to some examples of these three types of early application. 
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II. A. Summarizing Relationships . 

One of our early examples is drawn not from social but from agricultural 
science. Jay Lush (1931) investigated the relationship between the initial 
weight of steers and their subsequent gain in weight. Lush had six samples 
of steers available and he was interested in computing the average of the six 
correlations he had available (median r = .39). 

What made these six samples of steers famous was not Lush’s averaging 
of correlations, but George Snedecor’s (1946) putting the six correlations 
into his classic textbook of statistics as an example of how to combine corre¬ 
lation coefficients. Subsequent editions have retained that famous example 
(e.g., Snedecor & Cochran, 1980, 1989). Snedecor’s long-time coauthor 
William G. Cochran had himself been a pioneer in the field of meta-analysis. 
Early on he had addressed himself to the statistical issues involved in com¬ 
paring and combining the results of series of studies (Cochran, 1937,1943). 

In his textbook example, Snedecor (1946) did much more than show how 
to combine estimates of magnitudes of relationships (r’s). He also showed 
how to assess the heterogeneity of a set of correlation coefficients. That is, 
he showed how a x 2 test could be employed to help us judge whether, over¬ 
all, the correlations differed significantly from each other. 

Moving from an agricultural to a social science example, my own early 
meta-analytic research was also concerned with estimating average correla¬ 
tions. In one summary of what was then known about the effects of experi¬ 
menters’ expectancies on the results of their research, the average correla¬ 
tions (based on a number of studies each based on a number of 
experimenters) were reported between experimenters’ expectancies for their 
subjects performance and how their subjects subsequently did perform (Ro¬ 
senthal, 1961,1963). These average correlations were computed separately 
for experimenters who were explicitly encouraged to bias their results (me¬ 
dian r = - .21) and those who were not (median r = .43). A test (contrast) 
was then performed to help judge whether these average correlations dif¬ 
fered significantly from each other. They did differ, suggesting that, while 
under ordinary circumstances experimenters tended to get the results they 
expected to get, they tended to get significantly opposite results when they 
felt unduly influenced (or even bribed) to bias the results of their research 
(Rosenthal, 1961, 1963). Analogous analyses were performed on a series of 
studies investigating the relationship between experimenters’ personality 
and the extent to which they obtained data affected by their expectancy (Ro¬ 
senthal, 1961, 1963, 1964). 

Snedecor’s textbook example of testing for the heterogeneity of a set of 
correlation coefficients was also applied to the study of experimenter ef¬ 
fects. In one such analysis, eight studies could be found in which experi- 
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menters had served as subjects in the same task they were now administer¬ 
ing to others in their role of experimenter. The correlations could therefore 
be obtained between the performance of experimenters at a given task and 
the average performance of those experimenters’ subjects on the same task. 
The application of Snedecor’s test showed the eight correlation coefficients 
to be significantly heterogenous (Rosenthal, 1961, 1963). 

Snedecor’s textbook example illustrated both the computation of average 
r’s and a test for heterogeneity of r’s. What his example did not illustrate was 
an overall test of significance to help us judge the probability that the partic¬ 
ular set of r’s with their associated tests of significance could have been 
obtained if the true value of r in the appropriate population were zero. Had 
Snedecor wanted to he could readily have illustrated the process of combin¬ 
ing probability levels. At least two major figures in the history of mathemat¬ 
ical statistics, Ronald Fisher (e.g., 1932, 1938) and Karl Pearson (1933a, 
1933b) had already described procedures for combining probabilities. Even 
earlier, Tippett (1931) had described a related procedure that did not exactly 
combine probabilities but “protected” the smallest obtained p by multiplying 
it by the number of tests of significance examined. 

Mosteller and Bush (1954) broadened the Fisher and Pearson perspec¬ 
tives and made several methods of combining independent probabilities 
available to social scientists in general and to social psychologists in particu¬ 
lar. An early and ingenious application of a method of combining probabili¬ 
ties was described by Stouffer and his colleagues (1949). For three samples 
of male soldiers, data were available on the favorability of their view of 
women soldiers as a function of the presence of women soldiers at their own 
camp. Male soldiers tended to be more unfavorable to women soldiers (as 
defined by not wanting their sisters to join the Army) when women soldiers 
were at their camp. 

Returning to our examples of studies of experimenter expectancy effects 
we find illustrations of the application of the Stouffer method of combining 
probabilities. After the first three experiments showed the effects of experi¬ 
mentally-created expectations on the results of their research, the three 
probability levels obtained were combined to give an overall test of signifi¬ 
cance for the set of three studies (Rosenthal, 1966). 

II. B. Determining Moderator Variables 

In this section we describe an early application of meta-analytic proce¬ 
dures in which the goal was not to establish an overall relationship between 
two variables, but to determine the factors that were associated with varia¬ 
tions in the magnitudes of the relationships between two variables. Such 
factors are known as moderator variables because they moderate or alter 
the magnitude of a relationship. 
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An early application was by Thorndike (1933). He obtained the results of 
36 independent studies of the retest reliability of the Binet test of intelli¬ 
gence. Thorndike was not interested in an overall estimate of the retest relia¬ 
bility per se, but in how the magnitude of the retest reliability correlation 
varied as a function of the time interval between the first and second testing. 
As might be expected, the greater the interval, the lower the retest reliabil¬ 
ity. These intervals ranged from less than one month to 60 months with a 
median interval of about two years. 

Thorndike did not report an overall estimate of the retest reliability 
( r = .84) or the correlation between the magnitude of the retest reliability 
and the retest time interval (r(34) = —.39). He did report the estimated 
reliabilities separately for various retest intervals; e.g., less than one month 
(r = .95) to nearly five years (r = .81). (The average reliabilities reported 
here are not those reported by Thorndike but are corrected estimates.) 

A somewhat later example of the use of moderator variables is drawn 
from the research program on experimenter effects mentioned earlier. Eight 
studies had been summarized in each of which the performance of experi¬ 
menters at a given task could be correlated with the average performance of 
those experimenters’ subjects on the same task. Rosenthal (1963,1964) was 
interested in learning the degree to which these correlations changed from 
the earlier to the later studies. He found a significant and substantial 
(r = .81) effect of when the study was done; studies conducted earlier ob¬ 
tained significantly more positive correlations (relative to later-conducted 
studies) while later-conducted studies obtained more negative correlations 
(relative to earlier-conducted studies). 

II.C. Establishing Relationships by Aggregate Analysis 

In this section we describe an early application of the meta-analytic 
procedure wherein each study provides only aggregated (average) data for 
each variable. 

Underwood (1957) was interested in the relationship between the degree 
of retention of various kinds of learned materials (e.g., geometric forms, 
nonsense syllables, nouns) and the number of previous lists of materials that 
had been learned. He hypothesized that the more lists that had been learned 
before the recall tests, the greater would be the forgetting. Underwood 
found 14 studies that each yielded both required facts: percentage recalled 
after 24 hours and the average number of previously learned lists. His hy¬ 
pothesis was strongly supported by the data. The correlation between these 
two variables (r based on ranks) was dramatically large: r(12) = —.91! (We 
would not expect to find correlations that large within the individual studies 
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TABLE 1.1 

Illustration of Differences in Summarizing Relationships, Determining 
Moderator Variables, and Establishing Relationships 


A 

B 

C 

Correlations Between 
Teacher Expectations and 
Pupil Performance (r) 

Mean Ratings 
ofTeacher 
Excellence 

Mean Level 
of Pupil 
Performance 

.25 

8 

115 

.20 

9 

110 

.30 

9 

105 

.35 

7 

105 

.15 

6 

100 

.10 

5 

95 

.22 

7.3 

105.0 


NOTES: The mean of column A illustrates the summarizing function; the correlation between 
columns A and B (r = .59) (and A and C; r = .53) illustrates the examination of moderator variables; 
and the correlation between columns B and C illustrates the attempt to establish a relationship (r = 
.78). This table represents only a hypothetical example. 

contributing to the aggregate analysis, however, since it is characteristic of 
aggregate analyses to yield larger correlations.) 


ILB, Summarizing, Moderating, and Establishing 
Relationships Meta-Analytically 

As a review of the differences among the early meta-analytic procedures 
designed for the three different purposes we have illustrated so far. Table 
1.1 has been prepared, a hypothetical example illustrating differences in 
summarizing relationships, determining moderator variables, and estab¬ 
lishing relationships. Column A shows the results of six studies of teacher ex¬ 
pectancy effects expressed for each study as the correlation between teacher 
expectations and pupil performance. Column B shows the mean rating of the 
excellence of the teachers employed in each of the six studies. Column C 
shows the mean level of pupil performance found for all the children of each 
of the six studies. 

The summarizing function of meta-analysis is illustrated by the mean of 
Column A, i.e., the mean magnitude of the relationship between teacher 
expectations and pupil performance. 

The determination of moderator variables is illustrated by the correla¬ 
tion of the data in Column A and the data in Column B. That correlation 
(r = .59) shows that larger effects of teacher expectations are associated 
with teachers who have on the average been judged to be more excellent. 
Another illustration of moderating effects is found in the correlation be¬ 
tween Column A and Column C. That correlation (r = .53) shows that 
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larger effects of teacher expectations are associated with pupils who have on 
the average shown higher levels of performance. 

The attempt to establish a relation by aggregate analysis is illustrated by 
the correlation of Columns B and C. That correlation (r = .78) shows that 
higher levels of mean pupil performance are associated with higher levels of 
rated teacher excellence. It might be tempting to interpret this correlation to 
mean that better teachers produce higher levels of pupil performance but that 
cannot properly be inferred from the correlation obtained. It would take a 
freshly designed study to establish properly the causal factors, if any, contrib¬ 
uting to the obtained correlation. Similarly, the correlations describing the 
operation of moderator variables cannot be interpreted causally in most cases 
since we did not randomly assign studies to the various levels of the modera¬ 
tor variables. Cooper (1984,1989b) has made this point clearly and forcefully 
in his book in this series. Causal inferences, however, can be made about the 
results of the studies being summarized if these results are based on experi¬ 
ments involving random assignment of subjects to treatment conditions. 

III. THE CURRENT STATUS OF 
META-ANALYTIC PROCEDURES 

We have now examined several early examples of meta-analytic proce¬ 
dures, some going back over half a century. Although several of the proce¬ 
dures have been available for many years (the present writer has been em¬ 
ploying some for about 30 years), there has been no revolution in how we 
conduct reviews of the literature or summarize domains of research. That is, 
most reviews of the literature still follow a more traditional narrative style. 
However, there may be a revolution in the making. As evidence, consider 
that in their analysis of the number of publications on meta-analysis from 
the years 1976 to 1982, Lamb and Whitla (1983) found a strong linear in¬ 
crease (r = .85) from the 6 papers of 1976 to the 120 papers of 1982. Since 
that time the rapid increase in the employment of meta-analytic ideas and 
meta-analytic procedures is continuing (Hunter & Schmidt, 1990). 

The work that probably did the most to capture the imagination of the 
social sciences as to the value of meta-analytic procedures was the brilliant 
meta-analytic work of Gene Glass and his collaborators. Specifically, Glass 
and his colleagues, employing meta-analytic procedures very similar to 
those of the present writer (Rosenthal, 1961, 1963, 1969, 1976; Rosenthal 
& Rosnow, 1975) but developed independently, were able to demonstrate 
dramatically the effectiveness of psychotherapy (Glass, 1976a, 1976b, 
1977; Smith & Glass, 1977; Smith, Glass, & Miller, 1980). Partly because 
of the work of Glass and his group, the last few years have shown a rapidly 
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growing number of investigators who have been discussing, employing, and 
developing a variety of meta-analytic procedures. These investigators in¬ 
clude Bloom, 1964; Cook & Leviton, 1980; Cooper, 1979, 1982, 1984, 
1989a; Cooper & Hazelrigg, 1988; Cooper & Rosenthal, 1980; DePaulo, 
Zuckerman, & Rosenthal, 1980; Dusek & Joseph, 1983; Eagly & Carli, 1981; 
Feldman, 1971; Fiske, 1983; Glass, 1976,1980; Glass & Kliegl, 1983; Glass 
ef ah, 1981; Green & Hall, 1984; Hall, 1980,1984; Harris & Rosenthal, 1985, 
1988; Hedges, 1981,1982a, 1982b, 1982c, 1983a, 1983b; Hedges & Olkin, 
1980, 1982, 1983a, 1983b, 1985; Hunter & Schmidt, 1990; Hunter et ah, 
1982; Kulik, Kulik, & Cohen, 1979; Light, 1979; Light & Pillemer, 1982; 
1984; Light & Smith, 1971; Mintz, 1983; Mullen, 1989; Mullen & Rosenthal, 
1985; Pillemer & Light, 1980a, 1980b; Rosenthal, 1963,1964,1968, 1969, 
1976, 1978, 1979, 1980, 1982, 1983a, 1983b, 1983c, 1985a, 1985b, 1986, 
1987a, 1987b, 1990; Rosenthal & DePaulo, 1979; Rosenthal & Rosnow, 
1975; Rosenthal & Rubin, 1978a, 1978b, 1979a, 1980,1982b, 1982c, 1983, 
1984,1986,1988,1989,1991; Shapiro & Shapiro, 1983; Shoham-Salomon 
& Rosenthal, 1987; Smith, 1980; Smith & Glass, 1977; Smith et al., 1980; 
Strube, Gardner, & Hartmann, 1985; Strube & Hartmann, 1983; Sudman & 
Bradburn, 1974; Taveggia, 1974; Viana, 1980; Wachter & Straf, 1990; 
Walberg & Haertel, 1980; Wilson & Rachman, 1983; Wolf, 1986; Zucker¬ 
man, DePaulo, & Rosenthal, 1981, and the many others cited in the references 
of these workers. 

In the pages that lie ahead we consider in detail how to employ a variety 
of meta-analytic procedures. Our procedures are not perfect, we can use 
them inappropriately, and we will make mistakes. Nevertheless, the alter¬ 
native to the systematic, explicit, quantitative procedures to be described is 
even less perfect, even more likely to be applied inappropriately, and even 
more likely to lead us to error. There is nothing in the set of meta-analytic 
procedures that makes us less able to engage in creative thought. All the 
thoughtful and intuitive procedures of the traditional review of the litera¬ 
ture can also be employed in a meta-analytic review. However, meta- 
analytic reviews go beyond the traditional reviews in the degree to which 
they are more systematic, more explicit, more exhaustive, and more quanti¬ 
tative. Because of these features, meta-analytic reviews are more likely to 
lead to summary statements of greater thoroughness, greater precision, and 
greater intersubjectivity or objectivity (Kaplan, 1964). In the final chapter 
of this book we consider systematically the several criticisms that have been 
made of meta-analytic procedures and products. 
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IV. AN EMPIRICAL EVALUATION 
OF META-ANALYTIC PROCEDURES 

Harris Cooper and 1 were interested in assessing empirically the effects 
of employing meta-analytic procedures on the conclusions drawn by inves¬ 
tigators in training (i.e., graduate students) and experienced investigators 
(i.e., faculty members) (Cooper & Rosenthal, 1980). The basic idea was to 
ask the participants to conduct a review of the literature to address the ques¬ 
tion of sex differences in task persistence. Some of the participants were 
randomly assigned to the meta-analytic procedure condition, and some 
were randomly assigned to the traditional procedure condition. All of the 
participants were given the same seven studies that we knew beforehand 
significantly supported overall the hypothesis that females showed greater 
task persistence. 

There was a total of 41 participants initially blocked on sex and faculty 
(versus graduate student) status. However, since neither of these variables 
affected the results of the experiment, results were reported for all 41 par¬ 
ticipants combined. Participants assigned to the meta-analytic procedure 
condition were asked to record the significance level of each study and were 
given detailed instructions on how to combine these significance levels to 
obtain an overall test of significance for the entire set of seven studies. Par¬ 
ticipants assigned to the traditional procedure condition were asked to em¬ 
ploy whatever procedures they would normally employ to conduct a review 
of the literature. 

After participants completed their reviews, they were asked whether the 
evidence supported the conclusion that females were more task persistent 
than males. They could respond: definitely yes, probably yes, can’t tell, 
probably no, or definitely no. Participants were also asked to estimate the 
magnitude of the relationship between gender and persistence. To this ques¬ 
tion they could respond: none at all, very small, small, moderate, large, and 
very large. 

Despite the fact that the set of seven studies reviewed showed a clearly 
significant relationship between sex and task persistence, 73% of the tradi¬ 
tional reviewers found probably or definitely no support for the hypothesis 
compared to only 32% of the meta-analytic reviewers. That difference (sig¬ 
nificant at p < .005), suggests that traditional methods of reviewing may 
suffer a very considerable loss of power relative to meta-analytic methods. 
Put another way, the incidence of type II errors (failing to reject null hypoth¬ 
eses that are false) may be far greater for the traditional than for the meta- 
analytic procedures of summarizing research domains. 

NOTE 

1. Throughout this book reference to the social sciences or to the behavioral sciences will 
refer to both the social and behavioral sciences. 
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Defining Research Results 

The concept of “research results” is clarified and the relationship between tests of signifi¬ 
cance and estimates of effect sizes is emphasized. Various types of effect size estimates and 
adjustments of these estimates are described. Finally, methods of dealing with the problem 
of multiple correlated results are described. 

Much of the rest of this book will deal with quantitative procedures for 
comparing and combining the results of a series of studies. Before these 
procedures can be discussed meaningfully, however, we must become ex¬ 
plicit about what we mean when we refer to the results of an individual 
study. 

We begin by stating what we do noti nean when we refer to the results of a 
study: We do not mean the conclusion drawn by the investigator, since that 
is often only vaguely related to the actual results. The metamorphosis that 
sometimes occurs between the results section and the discussion section is 
itself a topic worthy of detailed consideration. For now it is enough to note 
that a fairly ambiguous result often becomes quite smooth and rounded in 
the discussion section, so that reviewers who dwell too much on the discus¬ 
sion and too little on the results can be quite misled as to what actually was 
found. 

We also do not mean the result of an omnibus F test with df > 1 in the 
numerator or an omnibus x 2 test with df > 1. In both cases we are getting 
quantitative answers to questions that are often—perhaps usually—hope¬ 
lessly imprecise. Only rarely is one interested in knowing for any fixed- 
factor analysis of variance or covariance that somewhere in the thicket of df 
there lurk one or more meaningful answers to meaningful questions that 
we had not the foresight to ask of our data. Similarly, there are few occa¬ 
sions when what we really want to know is that somewhere in a contingency 


13 






14 


META-ANALYTIC PROCEDURES 


table there is an obtained frequency or two that has strayed too far from the 
frequency expected for that cell under the null hypothesis. 

What we shall mean by the results is the answer to this question: What is 
the relationship between any variable X and any variable Y? The variables 
X and Y are chosen with only the constraint that their relationship be of 
interest to us. The answer to this question must come in two parts: (1) the 
estimate of the magnitude of the relationship (the effect size) and (2) an 
indication of the accuracy or reliability of the estimated effect size (as in a 
confidence interval placed around the estimate). An alternative to the sec¬ 
ond part of the answer is one not intrinsically more useful but one more 
consistent with the existing practices of social researchers, that is, the test of 
significance of the difference between the obtained effect size and the effect 
size expected under the null hypothesis of no relationship between variables 
X and Y. 

I. EFFECT SIZE AND 
STATISTICAL SIGNIFICANCE 

Since the argument has been made that the results of a study with respect 
to any given relationship can be expressed an an estimate of an effect size 
plus a test of significance, we should make explicit the relationship between 
these two quantitities. The general relationship is shown below: 

Test of _ Size of X Size of 

Significance Effect Study 

Tables 2.1 and 2.2 give useful specific examples of this general equation. 
Equation 2.1 shows that x 2 on df = 1 is the product of the size of the effect 
expressed by (f> 2 (the squared product moment correlation) multiplied by N 
(the number of subjects or other sampling units). It should be noted that (j> is 
merely Pearson’s r applied to dichotomous data, i.e., data coded as taking on 
only two values such as 0 and 1, 1 and 2, or +1 and — 1. 

Equation 2.2 is simply the square root of equation 2.1. It shows that the 
standard normal deviate Z (i.e., the square root of \ 2 on 1 d 0 is the product 
of 4> (the product moment correl ation) a nd VN. Equation 2.3 shows that t is 
the product of the effect size r/Vl -r 2 and V df, an in dex of the size of the 
study. The denominator of this effect size (V1 - r 2 ) is also known as the 
coefficient of alienation or k, an index of the degree of noncorrelation 
(Guilford & Fruchter, 1978). This effect size, therefore, can be rewritten as 
r/k, the ratio of correlation to noncorrelation, a kind of signal-to-noise ratio. 
Equations 2.4 and 2.5 share the same effect size, the difference between the 
means of the two compared groups divided by, or standardized by, the 
unbiased estimate of the population standard deviation. 
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TABLE 2.1 

Examples of the Relationship Between Tests of Significance 
and Effect Size: x 2 (l)» Z, and t 


Test of _ Size of x Size of 

Equation Significance _ Effect _ Study 


(^4 


vipr X Vdf 
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This latter effect size (M j - M 2 )/S is the one typically employed by Glass 
and his colleagues (1981) with the S computed as [2 (X - X) 2 /(n c — i) ]V 2 
employing only the subjects or other sampling units from the control group. 
The pooled S—that is, the one computed from both groups—tends to pro¬ 
vide a better estimate in the long run of the population standard deviation. 
However, when the S’s based on the two different conditions differ greatly 
from each other, choosing the control group S as the standardizing quantity 
is a very reasonable alternative. That is because it is always possible that the 
experimental treatment itself has made the S of the experimental group too 
large or too small relative to the S of the control group. 

Another alternative when the S’s of the two groups differ greatly is to 
transform the data to make the S’s more similar. Such transformations (e.g., 
logs, square roots, etc.) of course require our having access to the original 
data, but that is also often required to compute S separately for the control 
group. When only a mean square error from an analysis of variance is availa¬ 
ble we must be content to use its square root (S) as our standardizing denom¬ 
inator in any case. Or if only the results of a t test are given, we are similarly 
forced to compute the effect size using a pooled estimate of S. (We could use 
equations 2.4 or 2.5 to solve for (M, - M 2 )/S.) 

Before leaving the topic of whether to compute S only from the control 
group or from both groups we should remind ourselves of the following: 
When S’s differ greatly for the two groups so that we are inclined to com¬ 
pute S only from the control group, ordinary t tests may give misleading 
results. Such problems can be approached by approximate procedures (Sne- 
decor & Cochran, 1989, pp. 96-98) but are perhaps best dealt with by ap¬ 
propriate transformation of the data (Tukey, 1977). 

Equation 2.6 shows an effect size only slightly different from that of 
equations 2.4 and 2.5. The only difference is that the standardizing quantity 
for the difference between the means is o (pooled sums of squares divided 
by N) rather than S (pooled sums of squares divided by N - k for k groups). 
This is one of the effect sizes employed by Cohen (1969,1977,1988) and by 
Friedman (1968). Basically this index, Cohen’s d, is the difference between 
the means of the groups being compared given in standard score units or 
z-scores. Equation 2.7 shows (Mj - M 2 )/a expressed as d and the size of study 
term simplified considerably for those situations in which it is known or in 
which it can be reasonably assumed that the sample sizes (nj and n 2 ) are equal. 

Equation 2.8 of Table 2.2 shows that F with one df in the numerator is the 
product of the squared ingredients of the right hand side of equation 2.3 of 
Table 2.1. That is just as it should be, of course, given that t 2 = F when df = 

1 in the numerator of F. 

Equation 2.9 is the generalization of equation 2.8 to the situation of df > 
1 in the numerator. Thus eta 2 refers to the proportion of variance accounted 


DEFINING research results 

for just as r 2 does, but eta 2 carries no implication that the relationship be¬ 
tween the two variables in question is linear. Equation 2.10 shows the effect 
size for F as the ratio of the variance of the condition means to the pooled 
within group variance, while the size of the study is indexed by n, the num¬ 
ber of observations in each of the groups. Because we rarely employ fixed 
effect F tests with df > 1 in the numerator in meta-analytic work, equations 
2 9 and 2.10 are used infrequently in summarizing domains of research. 

LA. Comparing r to d 

Equation 2.11 has for its test of significance a t for correlated observa¬ 
tions or repeated measures. It is important to note that this equation for the 
correlated t is identical to equation 2.3 (Table 2.1) for the independent sam¬ 
ples t. Thus when we employ r as our effect size estimate, we need not make 
any special adjustment in moving from t tests for independent to those for 
correlated observations. That is not the situation for equations 2.12 and 
2.13, however. When the effect size estimates are the mean differences 
divided either by S or by <r, the definition of the size of the study changes by 
a factor of 2 in going from t for independent observations to t for correlated 
observations. This inconsistency in definitions of size of study is one of the 
reasons I have grown to prefer r as an effect size estimate rather than d, after 
many years of using both r and d. 

Another reason for preferring r over d as an effect size estimate is that we 
are often unable to compute d accurately from the information provided by 
the author of the original article. Investigators sometimes report only their 
t’s and df’s but not their sample sizes. Therefore, we cannot use equations 
2.4, 2.5, or 2.6 to compute the effect sizes. We could do so only if we 
assumed n, = n 2 . If we did so, for example, from rearranging equation 2.7, 
we could get d as follows: 



If the investigator’s sample sizes were equal, d would be accurate, but as 
n } and n 2 become more and more unequal, d will be more and more underes¬ 
timated. Table 2.3 shows for eight studies, all with t = 3.00 and df=nj + n 2 
- 2 - 98, the increasing underestimation of d when we assume equal n’s 
and employ equation 2.14. It should be noted, however, that when the split 
is no more extreme than 70:30 the underestimation is less than 8%. 

A third reason for preferring r to d as an effect size estimate has to do 
with simplicity of interpretation in practical terms. In the final chapter of 
this book we describe the BESD (binomial effect size display), a method for 
displaying the practical importance of the size of an obtained effect. Using 
this method we can immediately convert r to an improvement in success rate 
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TABLE 2*3 

Underestimation of d by “Equal n” Formula 


Study 


n 2 

Accurate d a 

Estimated d h 

Raw 

Difference 

Underestimated 
(in percentages) 

1 

50 

50 

.61 

.61 

.00 

.00 

2 

60 

40 

.62 

.61 

-.01 

.02 

3 

70 

30 

.66 

.61 

-.05 

.08 

4 

80 

20 

.76 

.61 

-.15 

.20 

5 

90 

10 

1.01 

.61 

-.40 

,40 

6 

95 

5 

1.39 

.61 

-.78 

.56 

7 

98 

2 

2.16 

.61 

-1.55 

.72 

8 

99 

1 

3.05 

.61 

-2.44 

.80 



n H = U n 1 _ + _ n gL 
Vdf Vn^n 2 

j _ 2t --- 


= General formula from rearranging equation 2.6. 

/n^n 2 

"equal n M formula from rearranging equation 2.7. 


associated, for example, with employing a new treatment procedure or a 
new selection device, or a new predictor variable. Because of the probabil¬ 
ity of seriously misinterpreting its practical importance (as discussed in 
Chapter 7), we shall not use r 2 as an effect size estimate (Rosenthal & Rubin, 
1982c). 

A final reason for preferring r over d is its greater flexibility; r can always 
be used whenever d can be but d cannot always be used whenever r can be. 
Sometimes, for example, the basic hypothesis is that there will be a particular 
ordering of means, e.g., 1, 3, 5, 7 with contrast weights of -3, -1, 1, 3 
(Rosenthal & Rosnow, 1985). In such a situation d is useless but r suits very 
well indeed. 

Although I have grown to prefer r over d for the reasons just given, the 
most important point to be made is that some estimate of the size of the 
effect should always be given whenever results are reported. Whether we 
employ r, g, d, Glass’s A (difference between the means divided by the S 
computed from the control group only) or any of the other effect size esti¬ 
mates that could be employed (e.g., Cohen, 1977, 1988) is less important 
than that some effect size estimate be employed along with the more tradi¬ 
tional test of significance. 

LB. Computing Effect Sizes 

The emphasis in this book will be on r as the primary effect size estimate. 
Since most investigators do not yet routinely provide effect size estimates 
along with their tests of significance we must usually compute our own from 
the tests of significance they have provided. The following formulas can be 
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found by rearranging equations 2.1, 2.3, and 2.8 (Cohen, 1965; Friedman, 
1968): 


where df = n { -F n 2 — 2, and 

r = / METZ 

v F(l f —) 4- df error [2.17] 

where F(1, —) indicates any F with df = 1 in the numerator. 

In case none of these tests of significance have been employed or re¬ 
ported, we can usefully estimate an effect size r from a p level alone as long 
as we know the size of the study (N). We convert the obtained p to its stan¬ 
dard normal deviate equivalent using a table of Z values. We then find r 
from: 

Fz? z 

N VN ro i8i 


It should be noted that equations 2.15 to 2.18 all yield product moment 
correlation coefficients. It makes no difference whether the data are in di¬ 
chotomous or continuous form, or whether they are ranked. Thus correla¬ 
tions known as Pearson’s r, Spearman’s rho, phi, or point biserial r, are all 
defined in exactly the same way—though there are computational simplifi¬ 
cations available so that some appear to be different from others—and are 
interpreted in exactly the same way. 

If we should want to have r as our effect size estimate when only Cohen’s 
d is available we can readily go to r from d (Cohen, 1977): 


r = 


[2.19] 
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where p is the proportion of the total population that is in the first of the two 
groups being compared and q is the proportion in the second of the two 
groups, or 1 - p. When p and q are equal, or when they can be viewed as 
equal in principle, equation 2.19 is simplified to equation 2.20. 


In most experimental applications we use equation 2.20 because we think of 
equal population sizes in principle. We might prefer equation 2.19 in situa¬ 
tions where we have intrinsic inequality of population sizes as when we 
compare the personal adjustment scores of a random sample of normals and 
a random sample of hospitalized psychiatric patients. 

In those cases where we want to work with Cohen’s d but have only r 
available we can go from r to d: 




II. INFERENTIAL ERRORS 

If the reported results of a study always include both an estimate of effect 
size and a test of significance (or a related procedure such as a confidence 
interval) we can better protect ourselves against the inferential invalidity of 
type I and type II errors. There is little doubt that in the social and behavioral 
sciences type II errors (concluding that X and Y are unrelated when they 
really are related) are far more likely than type I errors (Cohen, 1962,1977, 
1988). The frequency of type II errors can be reduced drastically by our 
attention to the magnitude of the estimated effect size. If that estimate is large 
and we find a nonsignificant result, we would do well to avoid deciding that 
variables X and Y are not related. Only if the pooled results of a good many 
replications point to both a very small effect size on the average and to a 
combined test of significance that does not reach our favorite alpha level are 
we justified in concluding that no nontrivial relationship exists between X 
and Y. Table 2.4 summarizes inferential errors and some possible conse¬ 
quences as a joint function of the results of significance testing and the 
population effect size. 
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TABLE 2.4 

Population Effect Sizes and Results of Significance Testing 
as Determinants of Inferential Errors 


Results of Significance Testing 


Population 

Effect Size 

Not Significant 

Significant 

Zero 

No Error 

Type I Error 

Small 

Type II Error 3 

No Error b 

Large 

Type II Error c 

No Error 


a. Low power may lead to failure to detect the true effect, but if the true effect Is quite small the costs 
of this error may not be too great. 

b. Although not an inferential error, if the effect size is very small and N is very large we may mistake a 
result that is merely very significant for one that is of practical importance. 

c. Low power may lead to failure to detect the true effect and with a substantial true effect the costs of 
this error may be very great. 


III. ADJUSTING EFFECT SIZE ESTIMATES 

III. A. The Fisher and the Hedges Adjustments 

In this book our primary effect size estimator will be the correlation 
coefficient r. However, as the population value of r gets further and further 
from zero the distribution of r’s sampled from that population becomes 
more and more skewed. This fact complicates the comparison and combina¬ 
tion of r’s, a complication addressed by Fisher (1928). He devised a transfor¬ 
mation (z r ) that is distributed nearly normally. In virtually all the meta- 
analytic procedures we shall be discussing, whenever we are interested in r 
we shall actually carry out most of our computations not on r but on its 
transformation z r The relationship between r and z r is given by: 


Z, = '/> iog e [pry-] 


Fisher (1928, p. 172) noted that there was a small and often negligible 
bias in z r , each being too large by r-population/[2(N - 1)]. Only when N is 
very small while at the same time the r-population (the actual population 
value of r) is very substantial is the bias of any consequence. For practical 
purposes, therefore, it can safely be ignored (Snedecor & Cochran, 1989). 
Before leaving this introduction to z r , it should be noted that it would make a 
very serviceable effect size estimate but one not as easily interpreted as r 
(see the final chapter). 

There are analogous biases in other effect size estimates, such as Glass’s 
A, Hedges’s g, and Cohen’s d; Hedges (1981, 1982a) has provided both 
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exact and approximate correction factors. Hedges’s unbiased estimator g u i s 
given by 

g u - c(m)g [2.23] 

where g is the effect size estimate computed as (Mj — M 2 )/S (with S com¬ 
puted from both the experimental and the control groups) and c(m) is given 
approximately by 

c(m) = 1--— [2.241 


where m is the df computed from both the experimental and control groups 
or n* + n 2 - 2 (see also Hedges & Olkin, 1985). 


III. A. 1 . Illustrating Fisher's and Hedges's adjustments. To illustrate Fish¬ 
er’s and Hedges’s methods of adjustment we assume an experiment in which 
n x - 4 and n 2 = 8 with t (10) = 2.76. To illustrate Fisher’s method we need r 
as our estimate of effect size. Equation 2.3 of Table 2.1 can be used to obtain 
r by way of equation 2.16. For this example: 


r = v = ~ 658:Zr = 789 

The bias to be corrected in z r is r-population divided by 2(N — 1). Of course, 
we don’t know the r-population, but we can begin by employing the ob¬ 
tained r as a first approximation. We therefore estimate the bias in z r as 

658 

estimated bias t = 2(12 -1) = 

This bias is to be removed from the obtained z r of .789 so our corrected z r 


.789 - .030 = .759 


which is associated with a corrected r of .640. Since we now have a more 
accurate estimate of the population value of r (i.e., .640) we could repeat the 
calculations to obtain a still more accurate correction for bias: 
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estimated bias 2 = -- ;: « .029 

z 2(12 - 1 ) 


This corrected bias differs little from our first approximation and leads to a 
corrected z r of 


.789 - .029 = .760 

which is associated with a corrected r of .641. Note that the corrected r 
differs little from the uncorrected r (.658 versus .641) even though N was 
quite small (12) and r-population was estimated to be quite substantial. 

To illustrate Hedges’s method of correction for small sample bias we 
need g as our estimate of effect size. Since g is defined as (M t — M 2 )/S we 
can obtain g from equations 2.4 or 2.5 from Table 2.1 as 


s = t v^ = < 2 WM = 1 - 69 


[2.25] 


yhi + n 9 V4 + 8 

= t 1 2 = (2.76) - ----- - 1.69 

VT4TT8J 


To employ Hedges’s approximate correction we obtain g u as a function of 
c(m) and g. For this example m = 4 + 8 — 2 = 10, so: 


3 3 

c(m) ~ 1-— 1-— ~ 0231 

4(m)-l 39 ^ 


: (.9231)1.69 = 


Table 2.5 summarizes Fisher’s and Hedges’s adjustments for the present 
example. The reduction in effect size is greater for Hedges’s method than for 
Fisher’s method but since the metric r and the metric g are not directly 
comparable we must first find a common metric before we can interpret the 
relative magnitude of the corrections made. A suitable common metric is 
the t distribution on 10 df since both r and g can be expressed in terms of this 
distribution. The lower half of Table 2.5 shows that, for the present exam¬ 
ple, Hedges’s correction is more extreme than is Fisher’s correction, but 
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both corrections are less than 8% in units of the t(10) distribution. 

If one should want to convert r to g it can be done as follows: 


X V 


df(n x + n 2 ) 

n l n 2 


[2.27] 1 


TABLE 2.5 

Fisher’s and Hedges’s Adjustments for Bias 


If one should want to convert g to r it can be done as follows: 


g 2 n 1 n 2 + (n 1 +n 2 )df 



Effect Sizes 
r 

8 

Effect Size 

Original 

.658 

1.69 

Corrected 

.641 

1.56 

Difference 

.017 

.13 

Percentage reduction 

2.6 

7.7 

Location on t(10) Distribution 

Original 

2.76 

2.76 

Corrected 

2.64 

2.55 

Difference 

.12 

.21 

Percentage reduction 

4.3 

7.6 


III.B. The Hunter and Schmidt Adjustments 

The most elaborate set of adjustments has been proposed by Hunter and 
Schmidt (1990; see also Hunter, Schmidt, & Jackson, 1982). They recom¬ 
mend adjustment for unreliability of the independent and dependent vari¬ 
ables, dichotomization of continuous independent and dependent variables, 
restriction of range of the independent and dependent variables, imperfection 
of construct validity of the independent and dependent variables, and even 
the employment of unequal sample sizes for the experimental and control 
groups. The Hunter and Schmidt work is valuable for reminding us that there 
are many sources of noise that may serve to lower obtained effect sizes. Their 
work is also valuable for providing us with procedures for adjusting for these 
sources of noise. The application of these procedures gives us some estimate 
of what effect size we might expect to find in the best of all possible worlds. 
That is a useful thing to know —perhaps as a goal to strive for by developing 
better measures and better design procedures. However, it does not strike me 
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as the proper goal of a meta-analysis. That goal is to teach us better what is, 
not what might some day be in the best of all possible worlds when all our 
independent and dependent variables are perfectly measured, perfectly valid, 
perfectly continuous, and perfectly unrestricted in range. 

Even when these adjustments are made with the goal of setting some upper 
limits of what better instrumentation and better design procedures might 
yield in future research in an area, these adjustments must be applied with 
great caution. It has been known for nearly a century, for example, that 
correction for unreliability alone can yield “corrected” effect size correla¬ 
tions greater than 1.00 (Guilford, 1954; Johnson, 1944; Spearman, 1910). 

UI.C The Glass, McGaw, and Smith Adjustments 

Studies entering into a meta-analysis differ in the precision of the statisti¬ 
cal procedures employed in their analysis. Thus repeated measures designs 
(of which gain score analyses are a special case), analysis of covariance 
designs, and designs employing blocking will tend to produce larger effect 
sizes and more significant test statistics than would the analogous un¬ 
blocked posttest only designs. Glass, McGaw, and Smith (1981) have shown 
how we might convert the results of various designs onto a common scale of 
effect size (e.g., A or g) based on the unblocked posttest only. These adjust¬ 
ments cannot always be made for the results of other people’s studies, but 
can often be quite usefully employed. However, when they are employed, I 
recommend that both the adjusted and unadjusted statistics be reported. 

Just as repeated measures, covariance, and blocking designs tend to in¬ 
crease power, the use of nonparametric tests of significance may tend to 
decrease power, and Glass et al. (1981) provide adjustment procedures. As 
in the case of adjustments noted earlier, I recommend reporting the unad¬ 
justed statistics along with those that have been adjusted. When nonpara¬ 
metric tests have been employed, a useful estimate of effect size (r) can be 
obtained from looking up the standard normal deviate (Z) associated with 
the accurately determined p level and finding r from 

r =-v/— = — [2.18] 

1 ' N 

An alternative procedure is to find the t(df) that is equivalent to the obtained 
p and employ 
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In the past, the accuracy of these procedures has been limited by the 
structure of tables of t and of Z, which rarely gave p’s much below .0001. 
However, inexpensive hand-held calculators are now available that permit 
working with p’s as low as 1/10 500 , Z’s as large as 47.8, and t’s (e.g., for df = 
10) of 10 50 . 

IV. SOME SOLUTIONS TO THE PROBLEM 
OF MULTIPLE (CORRELATED) RESULTS 

Many of the studies entering into our meta-analyses will have more than 
one test of significance relevant to our hypothesis and, since for every test of 
significance there is an effect size estimate, these studies will have more 
than one effect size estimate as well. The various dependent variables em¬ 
ployed in a study should all be examined for clues as to the types of depen¬ 
dent variable that seem most affected and least affected by the independent 
variable of interest. If there are many studies using several of the same 
dependent variables one could perform a separate meta-analysis for each 
different type of dependent variable involved. For example, if one were 
studying the effects of alcoholism treatment programs, separate analyses 
could be performed for the dependent variables of sobriety, number of days 
of employment, number of arrests, general medical health, personal and 
social adjustment, and so on. Each of these types of dependent variable 
could be operationalized in several ways. For example, for each of them we 
could obtain self-reports, family reports, and institutional reports (e.g., 
from hospitals, clinics, courts, police departments, etc.). 

Table 2.6 shows a matrix of 6 types of dependent variables crossed by 3 
sources of information. If there were a set of studies that had employed all of 
the 6x3= 18 specific dependent variables, we could perform a separate 
meta-analysis on each of the 6 types of variables averaged across all 3 
sources of information to learn which variable, on the average, was most 
affected by the treatment. We could also perform a separate meta-analysis 
on each of the three sources of information averaged across all 6 types of 
variables to learn which of the sources was most affected by the treatment. 
We could examine these matters simultaneously for a set of K studies by 
entering effect sizes (or Z’s associated with significance levels) into each of 
the 6 x 3 = 18 cells of the matrix and then conducting a K x 6 x 3 analysis 
of variance on the effect sizes (or the Z’s). In such an analysis there would be 
K independent sampling units (studies) and repeated measures on the 6 
level factor of variable type and on the 3 level factor of information source. 
Such an analysis would be of great value for the simultaneous light it might 
shed on the effects of variable type, information source, and the interaction 
of these variables on the magnitude of the experimental effects obtained. 
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TABLE 2.6 

Matrix of Hypothetical Dependent Variables Obtained in a 
Set of Studies of Alcoholism Treatment Programs 

Source of Information 

f vpe0 f Self- Family Institutional 

y ar iable Report Report Report _ Mean 

pays of sobriety 

Days of employment 
Number of arrests 
Medical health 

Personal adjustment 
Social adjustment 

Mean _ 

Unfortunately, we do not often encounter such nicely filled-in matrices 
of effect sizes. Indeed, we count ourselves fortunate when even a substantial 
subset of studies have employed the same types of variables. Assuming the 
typical situation, then, how are we to analyze multiple results from a single 
study? Shall we count each result from a different dependent variable as 
though it were a separate study, i.e., as though it were an independent 
result? Smith et al. (1980) and Glass et al. (1981) have treated multiple 
results as though they were independent, a practice for which they have been 
unjustifiably criticized. Where have their critics gone wrong? They have 
confused the effect of nonindependence on significance testing with its effect 
on effect size estimation. Treating nonindependent results as independent 
does tend to create errors in significance testing, but Smith et al. and Glass 
et al. did not do significance testing. Treating nonindependent results as 
independent for purposes of effect size estimation simply weights each study 
in proportion to the number of different effect sizes it generates. Although 
not all meta-analysts may wish to employ such weighting, there is certainly 
nothing wrong with doing so. 

My own recommendation is to have each study contribute only a single 
effect size estimate and a single significance level to the overall analysis. 
That recommendation does not preclude computing additional overall effect 
size estimates in which each study is weighted by the number of research 
results it yields, by its sample size, by its quality, or by any other reasonable 
weighting factor. 

In the following sections some procedures are proposed that can be used 
to obtain a single research result from a set of correlated research results. 
We begin by describing procedures applicable in the usual meta-analytic 
situation in which we are given relatively few details in the report available 
to us. Subsequently we describe procedures applicable when more of the 
original data are available to us. 
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In most of these applications we will find that significance levels and 
effect sizes are highly correlated. That follows from the fact that most cor¬ 
related results will be based on approximately the same sample size. When 
that is the case there tends to be a perfect monotonic relationship between 
significance level and effect size. 

IV.A. Original Data or Intercorrelations 
Among Dependent Variables Not Available 

IV. A. 1. Method of mean result. Perhaps the most obvious method of ob¬ 
taining a single result for a set of results from a single study is to calculate 
the mean level of significance and the mean effect size. Suppose we have a 
set of three one-tailed p levels: .25, . 10, and .001. To average these p’s we 
first find the standard normal deviate (Z) corresponding to each, and aver¬ 
age these Z’s. (If results are simply reported as nonsignificant, and we have 
no further information available, we have no choice but to assume a p level 
of .50, or a Z of 0.00.) In this example, our three Z’s are .67, 1.28, and 3.09, 
All three Z’s have positive signs because all results were in the same direc¬ 
tion. The mean of our three Z’s is [.67 + 1.28 + 3.09]/3 = 5.04/3 = 1.68, 
a Z corresponding to a p of .046. It should be emphasized that when we 
average p levels it is their associated Z’s we average and not the p levels 
themselves. This is discussed in detail in Chapters 4 and 5. 

To average several effect size estimates we simply take their mean if they 
are already in standard deviation units as in Cohen’s d, Glass’s A, and 
Hedges’s g. In the case of r, we first transform each r value to z r before 
finding the mean. If effect sizes have not been given we can compute our 
own, one for each p level, as long as we know N, the number of sampling 
units, since 


For the preceding three p levels we would find corresponding r’s of .067, 

. 128, and .309 if N - 100. The z r ’s associated with these r’s are found to be 
.07, .13, and .32, yielding a mean z r of .17 corresponding to an r of .17. 
When r’s are all quite low, averaging directly yields results essentially like 
those obtained when we first transform the r’s to z r ’s. For the present exam¬ 
ple, direct averaging of r’s also yields a mean r of. 17. 

An alternative procedure is to compute the mean p level and then simply 
compute the effect size corresponding to it. Although the two estimates 
will often yield similar values, it should be noted that the mean of a set of 
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effect sizes, each based on an associated p, is a different statistic than is the 
effect size associated with the mean p level. For example, imagine two p 
levels from the same study (N = 100) associated with standard normal devi¬ 
ates of 0.00 and 9.00. Their mean is a Z of 4.5. The effect size r associated 
with this mean Z is 

r = ^L=.4 5 
vToo 


However, the two effect sizes associated with Z’s of 0.00 and 9.00 are r’s of 
.00 and .90 but z r ’s of 0.00 and 1.47, respectively. The mean of these z r ’s is 
about .74 corresponding to an r of .63. Clearly the two methods can yield 
quite different results (.63 versus .45), neither of which is intrinsically more 
correct than the other. A reasonable practice is to decide beforehand on one 
of these procedures and use it throughout any given meta-analysis. In no 
case, however, should both procedures be employed unless both are re¬ 
ported. In other words, it will not do to compute both estimates and then 
report or use in the meta-analysis only the personally preferred estimate. 

Sometimes only one or two p levels are reported when a whole array of 
effect sizes is available. That might happen, for example, if the report pro¬ 
vides a correlation matrix in which the dummy-coded (0,1) independent 
variable is correlated with a whole series of dependent variables. In this 
case, of course, we would base our effect size estimate on the mean of all the 
effect sizes, not just those for which p levels are reported. In addition, we 
base our p level estimate on the mean of all the p levels associated with all 
the effect sizes reported. By rearrangement from equation 2.18 we can get 
the Z associated with each r from the following: 

Z = rVN* 

An alternative to computing the mean of all the Z’s obtained from this 
procedure is to compute only the Z associated with the mean effect size. 
The cautions given above about the process of choosing one of these proce¬ 
dures to present as “the result’’ should be kept in mind. 

IVA.2. Method of median result. When the p levels and/or the effect 
sizes produced by a single study are very skewed, some meta-analysts may 
prefer to compute the median p level and the median effect size. Although 
there are a great many statistical applications where medians are to be pre¬ 
ferred to means (Tukey, 1977), the use of medians in meta-analytic work 
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tends to give results consistently favoring type II errors, i.e., results leading 
to estimates favoring the null hypothesis. An intuitively clear example 
might be the following five p levels, all one-tailed: .25, . 18, . 16, .001, and 
.00003. The median p of. 16 is notably larger than the mean p of .027 asso¬ 
ciated with the mean Z of 1.93. Intuition may suggest that the mean is a 
better estimate of the gist of the five p levels than is the median, given two 
such very significant results in the set of five correlated results. That intui¬ 
tion will be supported by the logic of the Bonferroni-based methods to be 
discussed next. 

IV A3 . Method of ensemble-adjustment of p. Suppose we had four p lev¬ 
els for a given study: .50, .50, .50. and .001. The median p is .50 and the 
mean p is .22. But somehow we think that, of four results, we should not 
find a p as low as .001 if the null hypothesis were true. The Bonferroni- 
based procedures address such issues. One can examine a set of R correlated 
results for a single study, compute the most significant p, and calculate the 
conservative corrected p that this most significant found p could have been 
obtained, after examining R results, if the null hypothesis were true (Rosen¬ 
thal & Rubin, 1983). All that needs to be done is to multiply the most signifi¬ 
cant p (p ms ) by R, the number of p levels that were examined to find the most 
significant p. Thus, for our example of R = 4 p levels where the most signifi¬ 
cant p was .001, the ensemble-adjusted p value is 


I 


p adjusted - (R) p ms = (4).001 = .004 


This procedure for correlated results, which is related to Tippett’s (1931) 
procedure applied to independent samples, is employed when we have no 
theoretical reasons to expect certain results to be more significant than oth¬ 
ers. When we have theoretical reasons to expect some results to be more 
significant than others, we can increase our power by assigning weights to 
each of the results to correspond to our view of their importance. (The ac¬ 
tual assignment of weights must be done by an investigator blind to the 
results obtained.) For example, suppose we knew a study to yield four p 
levels. Before examining the results we decided that the first result was of 
greatest importance with weight 5, the second and third results were of less 
importance with weights of 2 each, and the fourth result was of least impor¬ 
tance with weight 1. Suppose we obtained one-tailed p’s of .02,. 19, .24, and 
.40, respectively. Then the weighted adjusted p level for the most signifi¬ 
cant result would be given by 
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p adjusted, weighted 


2 weights 
weight of p ms 


5 + 2 + 2 + 1 


Therefore, the adjusted, weighted p is significant at p < .05, whereas the 
unweighted adjusted p value would not have been, since 

p adjusted = (R) p ms = (4).02 = .08. 

In addition, the mean p would have been .17, and the median p would have 
been .21. Further details on assigning weights in Bonferroni-based proce¬ 
dures are given in Rosenthal and Rubin £ (1984). Once we have computed our 
ensemble-adjusted p we compute the associated effect size from equation 
2.18. 

IV.B. Original Data or Intercorrelations 
Among Dependent Variables Available 

IV. B. 1. Creating a single composite variable. When we have access to 
the original data, an examination of the intercorrelations of our dependent 
variables may suggest that all our dependent variables are substantially 
intercorrelated. If that is the case we may want to create a composite variable 
made up of all our dependent variables. One easy way to do so is to 
standard-score (z score) each of our dependent variables and form the 
composite variable as the mean of the z scores earned on the contributing 
variables. This procedure weights the variables equally. If we have a priori 
theoretical reasons for weighting some variables more heavily than others 
we can do so. Any variable in its z score transformation can be multiplied by 
any weight Wj we like. Any subject’s score on the composite variable z w is 
then defined by the sum of that subject’s z scores, each multiplied by its 
weight, and this sum divided by the sum of all the weights employed, or 

- _ S( W i Zj) 


where z w is the mean weighted z score or composite variable score for any 
one subject, and w } is the weight given to the i th z score (z 5 ). 

As an example, imagine a subject whose z scores on four dependent vari¬ 
ables are 1.10, .66,1.28, and .92. The weights (w^ assigned to each of these 
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variables were decided on a priori theoretical grounds to be 4, 2, 1, and 1, 
respectively. Therefore, employing equation 2.31, our subject’s composite 
variable score (z w ) would be: 

- „ I(WjZi) _ (4)1.10 + (2).66 + (1)1.28 + (Q.92 _ 7.92 _ ^ 

Zw 2 Wi 4+2+14-1 8 

If we want an estimate of the internal consistency reliability of the com¬ 
posite variable, we can obtain it by one of three ways: (1) applying the 
Spearman-Brown formula to the mean of the intercorrelations among the 
constituent variables, (2) computing an intraclass correlation following an 
analysis of variance in which constituent variables become a repeated mea¬ 
sures factor, or (3) computing Armor’s theta from the unrotated first principal 
component. All three of these procedures are described in some detail in the 
following chapter and are summarized elsewhere (Rosenthal, 1982a, 1987a). 

An alternative to combining variables by z scoring is to combine the raw 
scores. This is a reasonable alternative only when the standard deviations of 
each of the constituent variables are similar. If they are not, the variables 
with larger variances dominate the others in the composite variable, usually 
for no good theoretical reason. For example, 6ne variable, ability test score, 
may have a = 20 while another variable, acceptance into a particular college 
(scored as 1 or 0) may have o = 0.50. Adding raw scores from these variables 
would yield a new variable that was very little affected by the second variable 
(acceptance). 

Situations in which direct adding of variables often is useful include those 
in which the variables are ratings by others on a specific rating scale, scores 
on subtests of personality tests or of tests of cognitive functioning. In no case, 
however, should variables in raw score form be combined without an exam¬ 
ination of the standard deviations of the variables. If the ratio of the largest 
to the smallest a is 1.5 or less, combining is safe. A larger ratio than 1.5 may 
be tolerated when the number of variables to be combined grows larger. 

IV.B.2. Creating a single estimate. A procedure has recently been 
described that allows us to combine effect sizes for multiple dependent 
variables knowing only the df and the typical intercorrelation among the 
dependent variables. The illustration of this procedure employs the effect size 
Cohen’s d (the difference between the means of the experimental and control 
groups divided by the pooled a). For the general case, and for technical 
details, the paper by Rosenthal and Rubin (1986) should be consulted. We 
obtain d c , the composite effect size, from 
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. gM|/[(n - l)/2]* 

dc " [pdXi) 2 + (l-p)ZX?] w ■' 

where h is the t test of the significance of the effect of the treatment on the 
jth dependent variable, is the weight we assign (before seeing the data) to 
the importance of the i* dependent variable, p is the typical intercorrelation 
among the dependent variables, and n is the number of sampling units (e.g., 
subjects) in each group, or if these n’s are unequal, the harmonic mean (n h ) 
of the two unequal sample sizes, or n h = 2njn 2 / (n^^). 

For our illustration assume an experiment with equal n (so = n 2 = 6) 
and with three dependent variables yielding three t(10)’s of .70, 1.37, and 
4.14, with the average intercorrelation (p) among the dependent variables of 
.50, and with the variables given equal weight so that X 1 = X 2 = X 3 = 1. Then 


[(1).70 + (1)1.37 + ( 1 ) 4 . 14 ] / [(6 - 1 ) / 2] i/2 __ 3.9275 
[.50(1 + 1 + 1) 2 + (1- .50)(1 2 + l 2 + l 2 )] 1 ^ ~ 2 4495 


= 1.60 


Should we want to express d c in terms of its equivalent effect size estimate 
r we can do so from 


/dT+4 


For our example, then 


/(1.60) 2 + 4 


We can test the significance of our composite effect size estimate by means 
of the following: 


[p(^i) 2 + (1 - p)2X? + (1 - p 2 )2Xft?/2df]^ 


For our example, then 

t (1)(.70)4(1)(1.37) + (1)(4.14) 

C * (.50(1 4 1 4 l) 2 4 (1 - .50)(1 2 4 l 2 4 l 2 ) 

4 (1 - .50 2 )[1 2 x (.70) 2 4 l 2 X (1.37) 2 4 l 2 X (4.14) 2 ] / 2(10)}^ 


[(.50) (9) + (.50) (3) + (.75) (19.5065) / 20]^ ~ 2.5945 " 
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with df = 10 and p < .02, one-tailed. (For procedures for comparing and 
combining multiple significance levels rather than effect sizes, see Strube, 
1985). 

If there should be a theoretical interest in computing contrasts among the 
effect sizes associated with the correlated dependent variables, procedures 
are given in Rosenthal and Rubin (1986) for estimating the effect size and 
significance level of any such contrast. 


V. A SUMMARY OF SOME EFFECT SIZE INDICATORS 

In this section we want to bring together the various effect size indicators 
that have been referred to as well as a few others that may prove useful 
Table 2.7 serves as a summary. The first four indicators include the very 
general Pearson product moment correlation (r) and three related indices. 
The indicator r/k is not typically employed as an effect size estimate though 
it certainly could be. It is included here because of its role in equations 2.3 
and 2.11; that is, it is an effect size estimate that needs only to be multiplied 
by Vdf to yield the associated test of significance, t. The index r/k turns out 
also to be related to Cohen’s d in an interesting way—it equals d/2 for situ¬ 
ations in which we can think of the two populations being compared as 
equally numerous (Cohen, 1977; Friedman, 1968). The indicator z r is also 
not typically employed as an effect size estimate though it, too, could be. 
However, it is frequently used as a transformation of r in a variety of meta- 
analytic procedures. Cohen’s q indexes the difference between two correla¬ 
tion coefficients in units of z r . 

The next three indicators of Table 2.7 are all standardized mean differ¬ 
ences. They differ from each other only in the standardizing denominator. 
Cohen’s d employs the cr computed from both groups employing N rather 
than N— 1 as the within group divisor for the sums of squares. Glass’s A and 
Hedges’s g both employ N— 1 divisors for sums of squares. Glass, however, 
computes S only for the control group, while Hedges computes S from both 
experimental and control groups. 

The last three indicators of Table 2.7 include two from Cohen (1977). 
Cohen’s g is the difference between an obtained proportion and a proportion 
of .50. The index d' is the difference between two obtained proportions. 
Cohen’s h is also the difference between two obtained proportions but only 
after the proportions have been transformed to angles (measured in units 
called radians, equal to about 57.3 degrees). 

Many other effect size indicators could have been listed. For example, 
Kraemer and Andrews (1982) and Krauth (1983) have described effect size 
estimates when medians rather than means are to be compared. These are 











Retrieving and Assessing 
Research Results 



Procedures for locating and abstracting research results are described and illustrated, 
and the reliability of these procedures is discussed. Various types of errors, their preven¬ 
tion and correction are described. Finally, the evaluation of the quality of research results 
is discussed. 


There is in principle no difference between the conscientious review of a 
research area conducted traditionally or meta-analytically. In both cases one 
wants to find all the research results. There may be logistic and financial 
reasons for restricting a review simply to published works, but there are no 
scholarly reasons for doing so if our goal is to summarize the research evi¬ 
dence bearing on a given relationship. After we have retrieved all the re¬ 
trievable research results, we will want to evaluate whether the sources of 
our results are significantly and substantially related to the quality of the 
research conducted and the magnitude of the effects obtained. If they are, 
we can present our meta-analytic results separately for the various sources 
of information and the various levels of quality of the research conducted. 

L RETRIEVING RESEARCH RESULTS 
LA. Locating Research Results 

Locating research results has become a more sophisticated enterprise 
than spending a few hours with the Psychological Abstracts, Sociological 
Abstracts, Child Development Abstracts, Language and Language Behavior 
Abstracts, or the International Bibliography of Social and Cultural Anthro¬ 
pology. Computer-based retrieval systems are not only available but are be¬ 
ing enlarged and improved at a rate so rapid that few social scientists can be 
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truly expert in the methods of information retrieval. That is the domain of 
the information specialist. To give the details required for the serious re¬ 
trieval of the results of a research area there is a useful paper on information 
retrieval especially prepared for meta-analysts by an experienced reference 
librarian (M. Rosenthal, 1985). 

When the resources described in that paper have been properly em¬ 
ployed, including an examination of the references of the retrieved docu¬ 
ments and correspondence with the contributors to a research area to obtain 
their unpublished manuscripts and their suggestions as to the location of 
other unpublished works, we will find four major classes of documents: (1) 
Books, including authored books, edited books, and chapters in edited 
books; (2) Journals, including professional journals, published newsletters, 
magazines, and newspapers; (3) Theses, including doctoral, master’s, and 
bachelor’s theses; (4) Unpublished work, including technical reports, grant 
proposals, grant reports, convention papers not published in proceedings, 
ERIC reports, films, cassette recordings, and other unpublished materials. 

IAJ. Reliability of information sources. The purpose of this section is 
to present the results of an analysis showing that, for a sample of meta¬ 
analyses, there is a high degree of reliability among the four types of docu¬ 
ments in the average effect size obtained. The raw data for these analyses 
come from Glass et al. (1981, pp. 66-67). The results of 12 meta-analyses on 
various topics are presented. For each meta-analysis, an effect size (Glass’s 
A or Cohen’s d) was estimated from at least two different information sources. 

Table 3.1 shows the six possible pairs of sources of information, the 
number of meta-analyses providing effect size estimates for each pair of 
sources, the reliability obtained between the two sources of each pair, (com¬ 
puted over all available areas meta-analyzed) and the p level of the reliabil- 

TABLE3.1 

Reliability of Information Sources 
for a Sample of Meta-Analyses 

Number (n) of Reliability (r) p Level of 

Meta-Analyses _ of Sources _ Reliability 

10 .89 .0005 

7 .65 .06 

7 .85 .008 

6 .82 .025 

4 .96 .02 

3 1.00 .005 

6.5 J .87 .014 

7 .85 .008 

.83 


Pairs of Sources 

Journal; thesis 
Journal; unpublished 
Thesis; unpublished 
Book; journal 
Book; thesis 
Book; unpublished 
Median 

Weighted Median 
Weighted (n - 2) Mean 
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ity obtained. The median (unweighted and weighted) reliabilities of .87 and 
.85 and the weighted mean r of .83 show that there is a high degree of 
reliability, on the average, between the various pairs of information sources. 
There is little support here for the position that holds that some sources of 
information may be misleading relative to the others. On the basis of the 12 
meta-analyses available, we can conclude that meta-analyses finding larger 
effect sizes from one source of information are also likely to find larger 
effect sizes from other sources of information. 

LA.2. Differences among information sources. High reliability of sources 
of information does not necessarily mean that sources will agree in their 
estimates of effect sizes for a given meta-analytic review. Two sources could 
have perfect reliability (r = 1.00) yet differ greatly in the effect sizes esti¬ 
mated, so long as the difference was constant for every meta-analysis. The 
purpose of this section is to present the results of an analysis investigating 
systematic differences among information sources in the average effect size 
found. Our raw data again come from Glass et al. (1981, pp. 66-67). 

For each of the 12 meta-analyses, the mean effect sizes are reported for 
all those sources that provided relevant information. If there had been four 
sources of information available for each of the 12 meta-analyses, we would 
have been able to provide a simpler answer to our question by merely exam¬ 
ining the means or medians obtained from all four sources. Unfortunately, 
only 3 of the 12 meta-analyses provided data from all four sources; 13 of the 
possible 48 (12 X 4) estimates were not available. Under these conditions, 
comparing the grand means or medians of each of the sources of informa¬ 
tion confounds the source of information with the area being summarized. 

As a rough guide to the likelihood that such confounding might be a 
problem, Table 3.2 was prepared. It shows the median effect size obtained 
from journal information for those meta-analyses in which the other sources 
of information were or were not available. Thus the median effect size of 
meta-analyses in which books were not available was .64 but the median 
effect size was only .44 when books were available. It appears, then, that we 
might erroneously conclude that books underestimate effect sizes relative 
to journals when actually it just happened that books were available as an 
information source for those areas of research showing smaller effect sizes 
anyway (as defined by journal information). 

Table 3.2 shows that the availability of unpublished material was unre¬ 
lated to effect sizes estimated from journal sources. This suggests no prob¬ 
lem of biased availability of studies as there had been for books as sources of 
information. The result for thesis-based information showed a small ten¬ 
dency for meta-analyses for which theses were available to be associated 
with somewhat larger effect sizes (as defined by journal information). The 
data base is not large enough to warrant firm conclusions, but the data are 
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TABLE 3.2 

Median Effect Sizes Obtained from Journal Information 
for Meta-Analyses in Which Other Sources 
Were or Were Not Available 


Information source source mean 

f ource Available Unavailable Difference M 

Book - 44<6,a - 64<6) -' 20 

Thesis •51 (l0 ' -40® .11 

Unpublished .50® .49® .01 

Mean .51 -.03 

a The number of meta-analyses on which the median is based is shown In parentheses, 
b. The median of all 12 meta-analyses based on journal information alone was also .50. 


TABLE 3.3 

Pairwise Comparisons of Effect Sizes Obtained 
from Four Information Sources 


Pairs of Sources 

Number (n) 
of Meta- 
Analyses 

First 
Mean A a 

Second 
Mean A" 

Mean 

Difference 

Median 

Difference 

Journal; thesis 

10 

.56 

.30 

,2& 

.22 

Journal; unpublished 

. 7. 

.56 

.64 

-.08 

.05 

Thesis; unpublished 

7 

.31 

.64 

-.33 

-.07 

Book; journal 

6 

.34 

.42 

-.08 

.00 

Book; thesis 

4 

.40 

.27 

.13 

.14 

Book; unpublished 

3 

.31 

.68 

-.37 

-.09 


a. Note that our purpose is not to estimate the average effect size obtained from specific sources of 
information since that depends most heavily on the areas meta-analyzed. Our purpose is to estimate 
as well as we can the difference between average sizes obtained from various sources of informa¬ 
tion. 

b. This is the only significant mean difference t{9) = 4.78, p = .001, two-tailed (p = .001 by sign test 
as well). 


suggestive enough that we should try to assess differences among informa¬ 
tion sources correcting for sampling bias. This type of correction can be 
achieved by considering each type of source pairwise with every other 
source as shown in Table 3.3. 

The first and second columns of Table 3.3 show the two paired sources of 
information and the number of meta-analyses upon which each pairwise 
comparison is based. The third and fourth columns give the mean effect size 
(A) found for the first and second named source, respectively. The fifth 
column gives the difference between the means with the second subtracted 
from the first. The final column, and perhaps the most important, gives the 
median of the n difference scores for each set of matched pairs. The only 
significant mean difference shows larger effect sizes obtained from journals 
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than from theses (excess of A = .26 and .22 for mean and median differ¬ 
ences, respectively). These results support the conclusion drawn by Glass et 
al. (1981) though the present pairwise analysis, controlling for the con¬ 
founding of topic and source, shows a difference larger by between 37% and 
63% than that reported by Glass et al. 

There is certainly no clear difference between mean effect sizes obtained 
from journals compared to unpublished materials. The mean difference fa¬ 
vors one by .08 A units; the median difference favors the other by .05 A 
units. The results of this analysis very strongly suggest that the burden of 
proof now rests on those who claim that unpublished (not unretrieved but 
retrievable unpublished) studies are biased in their results relative to pub¬ 
lished studies. 

On the average, theses obtain smaller effect sizes than do unpublished 
studies but the difference shrinks dramatically when the median difference 
is employed rather than the mean difference. 

Books and journals tend to obtain very similar effect sizes but books do 
tend to obtain somewhat higher effect sizes than do theses. Finally, books 
tend to obtain smaller effect sizes than do unpublished papers but the me¬ 
dian difference is not large and, given the small n (3), even the large mean 
difference of .37 could be a sampling fluctuation (p = .45). 

There is no simple way to summarize the data of Table 3.3. One provi¬ 
sional method that preserves the pairwise nature of the comparisons is to 
consider in turn all pairwise comparisons of each of the four sources with all 
others and report the median of all these comparisons. Table 3.4 shows that 
journal articles, unpublished manuscripts, and books are essentially indis¬ 
tinguishable from each other. However, theses tend to yield noticeably 
smaller effect sizes than do the other three sources of information— 
smaller, that is, by about 1/5 of a standard deviation. 

How might we explain this bias for theses to yield smaller effect sizes? One 
analysis that may be instructive was carried out as part of a meta-analysis by 
Rosenthal and Rubin (1978). In their study of 345 studies of interpersonal ex- 


TABLE 3.4 

Pairwise Comparisons of Effect Sizes Obtained 
From Each Source Against All Others 
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TABLE 3.5 

Mean Effect Sizes (d) for Dissertation and Nondissertation Studies 
Employing or Not Employing Special Control Procedures 

Unweighted Weighted 

Dissertations Nondissertations Mean Mean 

Special controls .78( 18 > a .54< 25 > .66 .64 

No special controls -,09<'*> W 288 > .33 .71 

Unweighted mean .345 .645 .495 

Weighted mean _ .40 _ .73 _ .70 

a. The number of studies on which the mean is based is shown in parentheses. 

pectancy effects, they computed separate analyses for dissertation and non¬ 
dissertation studies and found that dissertations did indeed yield substantially 
smaller effects on the average. Each of the 345 studies in their sample was also 
classified by whether the investigators) had taken special pains to control for 
errors of recording or cheating by the experimenters or teachers being 
studied. 

Table 3.5 shows the mean effect sizes obtained in dissertations and non¬ 
dissertations that either had or had not instituted special controls for inten¬ 
tional or unintentional errors. Most of the variation (93%) among the four 
estimated mean effect sizes was due to the difference between dissertations 
employing no special controls and the remaining three groups of studies 
which differed relatively little among themselves. These results suggest the 
possibility that the tendency for theses to yield smaller effect sizes than 
other sources of information may be due primarily to the less carefully exe¬ 
cuted of the theses. 

Before leaving the comparison of theses with other information sources, 
we should note that theses were conspicuously over-represented among the 
studies instituting special controls for intentional or unintentional errors. 
The correlation between being a thesis and employing special controls was 
.42 (^(l) = 62.0, N = 345, p < .0001). Of the nondissertations, only 8% 
employed special controls; of the dissertations, however, 56% did so. The 
typical dissertation, then, may be more carefully done than the typical non¬ 
dissertation. Perhaps this is due to the healthy monitoring that is often car¬ 
ried out by a conscientious dissertation committee. 

I.B. Abstracting Research Results 

Once we have located the studies to include in our meta-analysis, we 
must decide what information is to be abstracted from each document. We 
know from the last chapter that we will always want to record both the 
significance level and the size of the effect and that, if one of these is not 
provided, we can estimate it if we know the size of the study. But what else 
are we to record for each study? The answer, of course, depends on the 
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specific goals of our meta-analysis. It is easiest to begin with examples of 
useful formats for abstracting information from studies. 

I.B.l. Interpersonal expectancy effects . Since the early 1960s the 
present writer has been conducting meta-analyses of studies of the effects of 
experimenters’ (or teachers’ or clinicians’) expectations on the response ob¬ 
tained from their subjects (or pupils, or patients). For each of the studies 
retrieved, the following information was typically recorded: 

(1) Complete reference, as for a bibliography. 

(2) Authors’ full names and addresses, so that they could be contacted 
for further information about the study in question, about their 
work in progress, and about the work of others in the same general 
area. 

(3) Sex of data collector ; since the sex of data collector may be related 
to the results obtained. 

(4) Status of data collector, e.g., faculty member, doctoral candidate, 
graduate student, undergraduate, and so on, since it has been found 
that the status of the data collector may affect the results obtained. 

(5) Relationship of data collector to meta-analyst, so that correlations 
could be computed between the results obtained and the degree of 
acquaintanceship with the meta-analyst (Rosenthal, 1969). 

(6) Sex of subjects, number of each sex who served as subjects, pupils, 
or patients. 

(7) Nature of subject sample, i.e., where and how obtained. 

(8) Sex of experimenters, number of each sex who served as experi¬ 
menters, teachers, or clinicians. 

(9) Nature of experimenter sample, i.e., where and how obtained. 

(10) Relative status of experimenters, since smaller expectancy effects 
are obtained when there is little status differential favoring the ex¬ 
perimenter. 

(11) Task, test, or other behavior of subjects constituting the dependent 
variable. 

(12) Unusual design features, e.g., not an experiment but causal infer¬ 
ence strengthened by use of such procedures as cross-lagged panel 
analysis, analysis of covariance, partial correlations, path analysis, 
and so forth. 

(13) Additional control groups, as when high and low induced-expect¬ 
ancy conditions can be compared to a randomly assigned group of 
no-induced-expectancy subjects. 

(14) Procedural controls for cheating and/or observer error, as when all 
interactions are filmed, videotaped, or otherwise monitored, and/ 
or when experimenters’ recordings can be otherwise checked. 
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(15) Moderating variables, variables associated with differences in ob¬ 
tained results, the direction of their effect, effect size, and signifi¬ 
cance level. 

(16) Mediation data, any results bearing on the processes by which ex¬ 
perimenters, teachers, or clinicians may have communicated their 
expectations to their subjects, pupils, or patients. 

(17) Expectancy effect, effect size (including direction) and significance 
level associated with the effects of the experimenters’, teachers’, 
or clinicians’ expectancies. 

I.B.2. Psychotherapy outcome . A more detailed type of abstracting was 
employed by Glass and his colleagues (e.g., et al., 1981, pp. 80-91, 233- 
237) in their seminal meta-analysis of psychotherapy outcome experiments. 
They divided their coding into the methodological and substantive features 
that are briefly summarized here: 

I.B.2.a. Methodological features. These included (1) date of publication, 
(2) type of publication, (3) degree to which experimenter was blind, (4) how 
clients were obtained, (5) how clients were assigned to conditions, (6) client 
loss for each condition, (7) internal validity, (8) experimenter’s probable 
preference for outcomes, and (9) reactivity of outcome measure. 

I.B.2.b. Substantive features. These included: (10) professional field of 
experimenter, (11) similarity of client to therapist, (12) diagnosis of client, 
(13) duration of previous hospitalization, (14) intelligence of typical client, 
(15) mode of therapy (e.g., individual, group), (16) site of therapy, (17) du¬ 
ration of therapy, (18) therapist experience or status level, (19) outcome 
measures, (20) type of psychotherapy, (21) degree of confidence in deciding 
type of therapy, and (22) effect size. 

I.B.3. Ethnic group and social class differences in need for achievement. 
A very focused type of abstracting was employed by Harris Cooper (1984) 
in his comparison of ethnic groups and social class levels on need for 
achievement. A summary of his coding sheet follows: (1) complete citation; 
(2) source of reference; (3) sex of subjects, with n of each, for the two groups 
being compared; (4) average age of subjects in each group; (5) geographic 
location of each group; (6) other restrictions pertaining to each group; (7) 
ethnicity of each group; (8) mean and S of each ethnic group on need for 
achievement; (9) type of significance test and df error employed; (10) value 
of test statistic obtained, and df effect; (11) p level and effect size obtained; 
(12) direction of results; (13) social class of each group; (14) standardized vs. 
informal measure of social class for each group; (15) basis of classification 
for each group including occupation, salary, social status, or other; (16) 
mean and S of each social class group on need for achievement; (17-20) 
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items 9-12 (above) repeated for social class comparison; (21) dependent 
measures including TAT (n-Ach), French’s Test of Insight, California Psy 
chological Inventory, or other; and (22) variables interacting with ethnicity 
or social class. 

I.B.4. Constructing a format for abstracting research results. Examina 
tion of the preceding three examples of abstracting formats will be useful in 
the construction of a new format. Using these examples and a process of 
free association, the beginning meta-analyst can construct a preliminary 
form. This form should then be discussed with colleagues and advisors who 
can suggest other variables to be included. Finally, a revised form might be 
sent to workers in the area of the meta-analysis, with an invitation to have 
them suggest other variables that should be coded. 

I.C. The Reliability of Retrieval 

I.C.l. Reliability of locating research results. It would be useful to know 
the reliability of locating research results. If two meta-analysts set out to 
retrieve the relevant research results for the same research question, how 
closely would their acquisitions agree? No empirical answer to that question 
is available. We do know that if each meta-analyst employed only one (or 
two) research indices, each would miss an appreciable proportion of retriev¬ 
able studies (Glass et al., 1981, pp. 63-65). A thorough retrieval effort, how¬ 
ever, would involve going well beyond one or two research indices (M. Ro¬ 
senthal, 1985). 

It is not even obvious how one would determine the correlation defining 
reliability in the situation of our two meta-analysts. Would we set up a 2 X 2 
table with columns representing the first meta-analyst’s choices (i.e., in¬ 
cluded vs. not included in the analysis), and with rows representing the 
second meta-analyst’s choices (i.e., included vs. not included in the analy¬ 
sis)? What would be the entry for the cell included in neither analysis? 
Would it be the hundreds of thousands of studies not relevant to the analy¬ 
sis? Whatever the problem of computing (or even defining) the reliability of 
locating research results, the problems are identical whether the research 
summarizing process is to be traditional or meta-analytic. 

7.C.2. Reliability of coding study features. When we try to estimate the 
reliability of the coding of studies after they have been retrieved, we can do 
quite a bit better. Several studies have reported proportions of agreement on 
specific items as coded by two judges. 

Table 3.6 presents a summary of two of these studies. Study 1 is by 
Stock, Okun, Haring, Miller, Kinney, and Ceurvorst (1982), and Study 2 is 
by Jackson (1978). For each study sample items are given to illustrate items 
associated with varying proportions of agreement from below .50 to 1.00. 
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Proportion of 
A greement 

1.00 
.96-99 
.92-95 
,88-.91 
.84-87 
.80-. 83 


TABLE 3.6 

Examples of Items Obtaining Various 
Proportions of Agreement 


Number of items 
Range 

25th to 75th percentile 
Median _ 


Median age 
Mean age 
Age range 
Total N 

Was median age reported? 
Type of bivariate relationship 

Type of sampling procedure 


Total number of subsamples 


25 

.57-1.00 

.86-965 

.92 


_ Study 2 _ 

Name of periodical 

Research index used? 

Own studies cited? 

Secondary analyses done? 

Specific recommendations? 

Relationship exists? 

Critique of prior reviews? 

Percentage of studies cited 
that are not directly 
relevant 

Percentage of studies examining 
interaction effect 

Are surveys among the major 
approaches employed? 

65 

.44-1.00 
.71-.87 
.79 


Median proportions of agreement are very substantial but we could have 
had more confidence in the interpretation of these data had the correlations 
been provided rather than the proportion of agreements. It is possible to 
have near perfect agreement and still have reliability coefficients of only 
about .50. More detailed discussion of this problem is available in Lewin 
and Wakefield (1979), Rosenthal (1982a, 1987a), Wakefield (1980), and in 
our subsequent discussion of effective reliability, especially the final para¬ 
graph on product moment correlations (section ILC.l.a of this chapter). 

I.C.3. Reliability of significance level and effect size estimates. Perhaps 
the two things we want to code most reliably are the results themselves: 
results defined as significance levels and effect size estimates. Unfortu¬ 
nately there appear to be no reliability data on the estimation of significance 
levels as such. We come close in the raw data of the study by Cooper and 
Rosenthal (1980) described earlier. As part of that study, 19 meta-analysts 
were asked to decide whether a given set of seven studies supported the 
rejection of the null hypothesis. From the analysis of variance of the 19x7 
data matrix, we were able to compute the intraclass correlation which is 
analogous to the average inteijudge reliability. For the 19 meta-analysts, this 
correlation was .969, a very high degree of reliability. In a situation where 
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accurate p levels were being estimated (rather than accept-reject decisions) 
and meta-analysts might not even have retrieved the same studies, and dif¬ 
ferent procedures for combining p levels might have been used, the reliabik 
ity would surely be lower. 

When we turn to the estimation of effect sizes, relevant data are provided 
by Glass et al. (1981). They had six studies for each of which two judges 
computed Glass's A as the effect size estimate. The mean absolute differ¬ 
ence between the pairs of judges was only .07 standard deviation units (A’s). 
The mean algebraic difference was even smaller—.01 standard deviation 
units! Reliability, however, is indexed by a correlation coefficient rather 
than a mean difference and it is possible in principle to have a very small 
mean difference (i.e., excellent agreement in mean judgments) yet have a 
low reliability. That did not occur in this case. For the set of six studies and 
two judgments per study, the intraclass correlation was .993. 

Should future studies yield lower reliabilities, Glass et al. would not be 
surprised nor would I. As the former authors point out, although the defini¬ 
tion of A (and of other effect size estimates) is simple, in actual practice 
judgments must be made, assumptions must be made, and series of calcula¬ 
tions must be made and all of these may be made somewhat differently by 
equally well-trained and experienced meta-analysts. 


II. ASSESSING RESEARCH RESULTS 
II. A. Correcting Research Results 

The purpose of this section is to emphasize that error-making is normal. 
The meta-analyst will make mistakes and the authors of the studies summa¬ 
rized will have made mistakes. Careful reading of the original papers by the 
meta-analyst will often reveal errors. Fortunately, these errors can often be 
corrected before the meta-analytic procedures are applied (Rosenthal & 
Rubin, 1978). 

One type of error that is difficult for the meta-analyst to correct or even t 
diagnose is an error of recording the data as the data were being obtained by 
the original investigator. How often do recording errors occur? When they do 
occur, are they likely to favor the investigator’s research hypothesis? Building 
on some earlier work on this topic (Rosenthal, 1978b), I was able to collect 27 
studies for this book that yielded some up-to-date information on these two 
questions. Since most of these studies were designed at least in part to permi 
the quantitative assessment of error rates, they can not be regarded as repre 
sentative of behavioral research in general. We have no way of knowing, how 
ever, whether these studies are likely to yield overestimates or underestimate 
of rates of error-making. The 27 studies ranged widely in terms of researc 
area and locus of data collection, e.g., studies of reaction time, person percep 
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tion, learning (human and animal), task ability, psychophysical judgments, 
questionnaire responses, classroom behavior, and mental telepathy. In addi¬ 
tion to behavioral research, legal research and health research were repre¬ 
sented. Although there were not enough studies in these various categories to 
permit sensitive comparisons, there appeared to be no clear relationship be¬ 
tween area of research and either the rate of recording errors or the likelihood 
of errors being biased when they did occur. 

In most of the studies, errors were defined only in terms of misrecording 
a response that was either seen or heard by the data recorder. In a few cases, 
however, simple arithmetic was also required by the recorder so that ob¬ 
server errors could not be distinguished from arithmetic errors. In these 
cases, however, the results were so close to the results of studies of simple 
recording errors that they could safely be grouped together, at least for our 
present purpose. 

It is also important that for almost all the 27 studies located, the observers 
had finished their task to their satisfaction and did not know that their obser¬ 
vations would be checked for errors. Thus whatever checking could be done 
or was going to be done by the observers had been done at the time of the 
analysis of errors. It is unlikely, therefore, that the estimates of error were 
inflated due to the observers’ not having finished their checking operations. 

Not all of the studies provided the data in directly usable form and it was 
necessary to make some estimates from data provided. For example, an 
investigator might mention in passing that 10 responses were misrecorded 
by the observers but not how many observations were recorded altogether. 
A reasonable estimate of this total was often available, however, as when the 
investigator reported that 5 observers each collected data from 10 partici¬ 
pants each of whom made 20 responses (e.g., 5 X 10 x 20 = 1000). 

Table 3.7 shows, for each of the 27 studies, the number of observers 
involved, the number of recordings made, the number of errors committed, 
the percentage of all recordings that were wrong, and the percentage of the 
errors committed that favored the hypothesis of the observer. Tables 3.8 
and 3.9 present stem-and-leaf plots and robust summary statistics of the 
percentage of observations that were in error and the percentage of errors 
that favored the observers’ hypotheses (Rosenthal & Rosnow, 1975; Tukey, 
1977). Tukey (1977) developed the stem-and-leaf plot as a special form of 
frequency distribution to facilitate the inspection of a batch of data. Each 
number in the data batch is made up of one stem and one leaf, but each stem 
may serve several leaves. Thus, the seventh stem under recording error, a 
1., is followed by two leaves of 59 and 69, representing the numbers 1.59 
and 1.69. The first digit is the stem, the next digit(s) the leaf. The eye takes 
in a stem-and-leaf plot as it does any other frequency distribution, but the 
original data are preserved with greater precision in a stem-and-leaf plot 
than would be the case with ordinary frequency distributions. 
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TABLE 3.7 

Recording Errors in 27 Studies 


Study 

Observers Recordings 
(N — 711 )(N — 219,296) (N 

Errors Error Bias 

- 23,605) Percentage Percentage 

(1) Kennedy & Uphoff, 

1939 

28 

11,125 

126 

1.13 

68 

(2) Rosenthal etal., 1964 

30 

3,000 

20 

0.67 

75 

(3) Weiss, 1967 

34 

1,770 

30 

1.69 

85 

(4) Persingeret al., 1968 

11 

828 

6 

0.72 

67 

(5) Jacob, 1969 

36 

1,260 

40 

3.17 

60 

(6) Todd, 1971 

6 

864 

2 

0.23 

50 

(7) Glass, 1971 

4 

96 

4 

4.17 

33 

(8) Hawthorne, 1972 

18 

1,009 

16 

1.59 

19 

(9) McConnell, 1955 

393 

18,000 

0 

0.00 

— 

(10) Rosenthal & Hall, 

1968 

5 

5,012 

41 

0.82 

_ 

(11) Doctor, 1968 

15 

9,600 

39 

0.41 

— 

(12) Compton, 1970 

9 

3,794 

36 

0.95 

— 

(13) Howland, 1970 

9 

360 

9 

2.50 

— 

(14) Mayo, 1972 

15 

688 

0 

0.00 

— 

(15) Eisner et al., 1974 

12 

9,600 

66 

0.69 

— 

(16) Ruschetal., 1978 

2 

46,079 

22,339 

48.48 

— 

(17) Marvell, 1979 

2 

2,156 

52 

2.41 

— 

(18) Fleming & Anttonen, 
1971 

_ 

89,980 

558 

0.62 

— 

(19) Goldberg, 1978 

— 

5,600 

40 

0.71 

— 

(20) Tobias, 1979 

— 

4,221 

141 

3.34 

— 

(21) Tobias, 1979 

— 

4,254 

40 

0.94 

— ■■■■■■■! 

(22) Johnson & Adair, 1970 

12 

— 

— 

— 

62 

(23) Johnson & Adair, 1972 

12 

— 

— 

— 

58 

(24) Ennis, 1974 

42 

— 

— 

— 

74 

(25) Rusch et al., 3 974 

2 

— 

— 

— 

36 

(26) Johnson & Ryan, 1976 

6 

— 

— 

— 

91 

(27) Johnson & Ryan, 1976 

8 

— 

— 

— 

65 

Median 

12 

3,794 

39 

.94 a 

.64 b 


a. Median weighted by number of recordings = .62. 

b. Median weighted by number of recordings = .68. 


From Tables 3.7, 3.8, and 3.9 we note that the typical rate of making 
recording errors is about 1% but, that in an occasional study, the error rate 
can climb to an extraordinary level of over 48%. Normally we might expect 
an error rate that high when data are recorded from an analogue rather than 
a digital mechanism. Thus reading an analogue thermometer as 98.6 could 
be interpreted as wrong when a digital read-out tells us that the “true” tem¬ 
perature is 98.63. Extraordinary error rates may have more to say about an 
overly precise criterion than about any practical problem of measurement. 

These same three tables also suggest that, of the observational errors 
that are made, about two-thirds support the observer’s hypothesis when 
only half should do so if the observers were unbiased. (When each study 
was weighted by the number of errors made, the overall test that bias was 
nonzero yielded Z = 4.88, p < .000001.) 
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TABLE 3.8 

Stem-and-Leaf Plots of Recording Error and Bias Rates 
(in percentages) 


Recording Error 
Stem Leaf 



Bias _ 

Stem Leaf 


48 a 




9 

1 

i | 





8 

5 


17 a 




7 

4 

5 





6 

0 

2 5 7 8 

17 

34 



5 

0 

8 

50 




4 



41 




3 

3 

6 

59 

69 



2 



13 




1 

9 


62 

67 

69 

71 72 82 94 95 




00 

00 

23 

41 





a. Stems between 4 and 48 are omitted to save space. 

b. Stems were divided into upper and lower halves to spread out the distribution. 


TABLE 3.9 

Summary Statistics for Recording Error and Bias 



a. This value is a marked outlier; i.e., it deviates from the rest of the distribution of recording error 
rates at p much less than .001. The mean of the distribution dropping the highest and lowest scores is 
1.41. 


No one should be surprised to learn that data are sometimes wrongly 
recorded. Now, however, we have some idea of how often these errors oc¬ 
cur. The typical rate of 1% errors is low enough that, even if the errors were 
undetected, the conclusions of our studies would not be greatly affected. In 
several studies, analyzing the data with arid without the errors corrected 
made no difference although biased errors would occasionally push a result 
over the magic .05 cliff (Nelson, Rosenthal, & Rosnow, 1986; Rosenthal & 
Gaito, 1963, 1964). Investigators emphasizing confidence intervals, effect 
sizes, and obtained levels of p will be less misled by the presence of some 
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typical degree of error in their data than will investigators following a strict 
null hypothesis decision procedure (Snedecor & Cochran, 1967, page 28). 

Several implications flow from the results of Tables 3.7, 3.8, and 3.9, 
We should continue to keep track of error rates and the size of observer bias 
and do what we can to reduce our errors. Getting all the errors out is proba¬ 
bly not possible or even desirable from a cost/benefit perspective. It costs 
something to reduce errors, and it probably costs more and more to get rid 
of each error as there are fewer of them left. We may not feel it to be wise to 
give up half our research to be able to pay for bringing our accuracy rate 
from 99.0% to 99.9%, if that should be the price. 

Finally, there is something we can do to keep our errors random with 
respect to our hypotheses, so that they will not increase Type I errors. We 
can keep the processes of data collection and analysis as blind as possible for 
as long as possible. 

Of course it is no basis for rejoicing to learn that errors may be nearly 
universal even if they are not typically very damaging in their magnitude. 
Yet one desirable consequence of widespread awareness of error might be 
to generate a more task-oriented attitude toward error than is currently 
widely shared. Too often the current attitude is that poor scientists (they) 
make errors; good scientists (we) don’t make errors. Given this attitude, 
when we reanalyze others’ data, we may wax indignant or even triumphant 
when we find errors. Our goal, it should be remembered, is not to show 
someone’s answer wrong; our goal is to get the answer right. Perhaps if we 
held this more task-oriented attitude, investigators would be more willing to 
let others examine their data. Then, perhaps, there would be a drop in the 
frequency of raw-data-consuming fires, a frequency that exceeds the limits 
of credibility (Wolins, 1962). 

II.B. Evaluating the Quality of Research Results 

In our earlier discussion of the reliability of retrieval we examined evi¬ 
dence relevant to the accuracy of coding of various features of the studies 
retrieved. High rates of coder agreement were found for such variables as 
subjects’ average age and the periodical in which the study appeared. Lower 
agreement was found for features requiring a greater degree of personal 
judgment. In this section we continue the discussion of reliability but with 
an emphasis now not on what the study did, but how well the study investi¬ 
gated its topic. 

One of the major criticisms of meta-analyses is that poor studies are sum¬ 
marized as well as good studies. Wise meta-analysts make it their business to 
locate all the studies, poor as well as good. Wise traditional reviewers do the 
same. Once all the retrievable studies have been found, decisions can be made 
about the use of any study. Precisely the same decision must be made about 
every study retrieved: How shall this study be weighted? Dropping a study is 
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simply assigning a weight of zero. If there is a dimension of quality of study 
(e g., internal validity, external validity, and so on) then there can be a corres¬ 
ponding system of weighting. If we think a study is twice as good as another, 
we can weight it twice as heavily or four times more heavily, and so forth. 

There is a danger, however, in assigning quality weights to studies, i.e., 
that we will assign high weights to studies whose results we favor and low 
weights to those we do not favor (Glass, 1976). The ideal solution is to have 
each study coded by several excellent methodologists who have no special 
investment in the area being investigated. Their quality assessments would 
be made twice; once based only on their reading of the methods section and 
once based on their reading of the methods plus results section. The reason 
for the first rating is to ensure that at least one judgment of quality is made 
before the judge has learned the results of the study. One should be able to 
assess at least the design features relevant to both internal and external 
validity before reading the results. 

The specific judgments to ask of our methodologists can range from the 
most general question of overall quality rated on a 9-point rating scale, to 
intermediate level questions of quality of design, quality of statistical analy¬ 
sis, quality of ecological validity, and the like, all rated on a 9-point scale, to 
a series of very specific questions such as: Was random assignment of sub¬ 
jects employed? Was the assumption of independence of errors in the analy¬ 
sis of variance met? Whether highly specific variables are judged or not, in 
the end one overall variable (or a smallish number of fairly general varia¬ 
bles) relevant to quality will be constructed and will be correlated with size 
of the effect obtained (Glass, et al., 1981; Rosenthal & Rubin, 1978). 

Glass et al. (1981) have presented convincing evidence that, in the typi¬ 
cal meta-analysis, there is no strong relation between the quality of the study 
and the average size of the effect obtained. Nevertheless, whether such a 
relation exists should be assessed specifically for each question being ad¬ 
dressed meta-analytically. Once our methodologists have assessed each 
study for quality, we must assess the assessors. The assessment is made 
empirically by determining their reliability. We do not expect reliability co¬ 
efficients to be extremely high for complex judgments of research quality 
(Fiske, 1983). Nevertheless we need to know the reliability for several rea¬ 
sons. Perhaps the main reason is that knowing the reliability suggests 
whether we will need to increase our sample of judges of research quality. 

II.C. The Reliability of Judgments of Quality 

ll.C.l. Effective reliability . Suppose we had available two judges of the 
quality of the studies in our meta-analysis. The correlation coefficient re¬ 
flecting the reliability of the two judges’ ratings would be computed to give 
us our best (and only) estimate of the correlation likely to be obtained be¬ 
tween any two judges drawn from the same population of judges. This cor- 
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relation coefficient, then, is clearly useful; it is not, however, a very good 
estimate of the reliability of our variable, which is not the rating of quality 
made by a single judge but rather the mean of two judges’ ratings. Suppose, 
for example, that the correlation between our two judges’ ratings of quality 
were .50; the reliability of the mean of the two judges’ ratings, the “effec¬ 
tive” reliability, would then be .67 not .50. Intuition suggests that we should 
gain in reliability in adding the ratings of a second judge because the second 
judge’s random errors should tend to cancel the first judge’s random errors. 
Intuition suggests further that adding more judges, all of whom agree with 
one another to about the same degree, defined by a mean inter-judge corre¬ 
lation coefficient of. 50 for this example, should further increase our “effec¬ 
tive” reliability. Our intuition would be supported by an old and well-known 
result reported independently by Charles Spearman and William Brown in 
1910 (Walker & Lev, 1953). With notation altered to suit our current pur¬ 
pose, the well-known Spearman-Brown result is: 


1 + (n - l)r 


where R =“ effective” reliability 
n = number of judges 

r = mean reliability among all n judges (i.e., mean of n (n - l)/2 
correlations). 


Use of this formula assumes that a comparable group of judges would 
show comparable mean reliability among themselves and with the actual 
group of judges available to us. This assumption is virtually the same as that 
all pairs of judges show essentially the same degree of reliability. 

As an aid to investigators employing these and related methods, Table 
3.10 has been prepared employing the Spearman-Brown formula. 

The table gives the effective reliability, R, for each of several values of n, 
the number of judges making the observations, and r, the mean reliability 
among the judges. It is intended to facilitate getting approximate answers to 
each of the following questions: 

(1) Given an obtained or estimated mean reliability, r, and a sample of n 
judges, what is the approximate effective reliability, R, of the mean of the 
judges’ ratings? The value of R is read from the table at the intersection of 
the appropriate row (n) and column (r). 

(2) Given the value of the obtained or desired effective reliability, R, and 
the number, n, of judges available, what will be the approximate value of the 
required mean reliability, r? The table is entered in the row corresponding to 
the n of judges available and is read across until the value of R closest to the 
one desired is reached. The value of r is then read as the corresponding 
column heading. 


TABLE 3.10 

Effective Reliability of the Mean of Judges’ Ratings 
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NOTE: Decimal points omitted. 
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(3) Given an obtained or estimated mean reliability, r, and the obtained or 
desired effective reliability, R, what is the approximate number (n) of judges 
required? The table is entered in the column corresponding to the mean relia¬ 
bility, r, and is read down until the value of R closest to the one desired is 
reached. The value of n is then read as the corresponding row title. 

Examples of each of the preceding questions may be useful: 

(1) Meta-analysts want to work with a quality variable believed to show a 
mean reliability of .5 and they can afford only 4 judges at the moment. They 
believe they should go ahead with their study only if the effective reliability 
will reach or exceed .75. Shall they go ahead? Answer: Yes, because Table 
3.10 shows R to be .80 for an n of 4 and an r of .5. 

(2) Meta-analysts who will settle for an effective reliability no less than 
.9 have a sample of 20 judges available. In their selection of quality varia¬ 
bles to be judged by these observers, what should be their minimally ac¬ 
ceptable mean reliability? Answer: .30. 

(3) Meta-analysts who know their choice of variables to have a mean 
reliability of .4 want to achieve an effective reliability of .85 or higher. How 
many judges must be allowed for in their preparation of a research budget? 
Answer: 9. 

ll.C.l .a. Product moment correlations. It should be noted that the mean 
reliability (r) of Table 3.10 is to be a product moment correlation coefficient 
such as Pearson’s r or its special cases, the Spearman rank correlation (rho), 
the point biserial r, or the phi coefficient. It is not appropriate to employ 
such indices of reliability as percentage or proportion agreement; e.g., num¬ 
ber of agreements (A) divided by the sum of agreements (A) and disagree¬ 
ments (D), A/(A + D) or net agreements, (A — D)/(A + D). These indices 
should not only be avoided in any use of Table 3.10, but they should be 
avoided in general because of the greatly misleading results that they can 
yield. For example, suppose two judges are to evaluate 100 field studies for 
the presence or absence of external validity. If both the judges see external 
validity in 98 of the field studies and disagree only twice, they would show 
98% agreement; yet the x 2 testing the significance of the product moment 
correlation phi would be essentially zero! Thus two judges who shared the 
same bias (e.g., almost all field studies are externally valid) could consist¬ 
ently earn nearly perfect agreement scores while actually correlating essen¬ 
tially zero with one another (phi = -.01). 

II.C.2. Reliability and analysis of variance . When there are only two 
judges whose reliability is to be evaluated it is hard to beat the convenience 
of a product moment correlation coefficient as an appropriate index of relia¬ 
bility. As the number of judges grows larger, however, working with corre- 
lation coefficients can become inconvenient. For example, suppose we em- L 
ployed 40 judges and wanted to compute both their mean reliability (r) and 
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their effective reliability (R). Table 3.10 could get us R from knowing r but 
to get r we would have to compute (40 x 39)/2 = 780 correlation coeffi¬ 
cients. That is not hard work for computers, but averaging the 780 coeffi¬ 
cients to get r is very hard work for investigators or their programmers. 
There is an easier way and it involves the analysis of variance. 

Table 3.11 shows a simple example of three judges rating the quality of 
five studies on a scale of 1 to 7, and Table 3.12 shows the analysis of vari¬ 
ance of these data. Our computations require only the use of the last 
column, the column of mean squares (Guilford, 1954). Examination of com¬ 
putational formulas 3.2 and 3.3 given below shows that they tell how well 
the judges can discriminate among the sampling units (e.g., studies) minus 
the judges’ disagreements controlling for judges’ rating bias or main effects 
(e.g., MS encoders - MS residuals), divided by a standardizing quantity. 

Our estimate of R, the effective reliability of the ratings of all the judges 
is given by 


R(est.) 


MS studies — MS residual 
MS studies 


[3.2] 


Our estimate of r, the mean reliability or the reliability of a single average 
judge is given by 


r (est.) = 


MS studies - MS residual 


MS studies T (n - 1)MS residual 


[3.3] 


TABLE3.il 

Judges’ Ratings of Research Quality 



IIK 


Judges 



Studies 

A 

B 


C 

X 

1 

5 

6 


7 

18 

2 

3 

6 


4 

13 

3 

3 

4 


6 

13 

4 

2 

2 


3 

7 

5 

1 

4 


4 

9 

2 

14 

22 


24 

60 

11 



TABLE 3.12 


Analysis of Variance of Judges’ Ratings 

Source 

ss 


df 


MS 

pKfuTl tT 1 '*' 

24.0 


4 


6.00 

Bil'llEEMMMi 

11.2 


2 


5.60 

Residual 

6.8 


8 


0.85 
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where n is the number of judges as before (equation 3.3 is known as the 
intraclass correlation). For our example of Tables 3.11 and 3.12 we have 


6.00 ~ 0.85 


R (est.) = 


r(est.) = 


6.00 - 0.85 
6.00 + (3 - 1)0.85" 


In the present example it will be easy to compare the results of the analy¬ 
sis of variance approach with the more cumbersome correlational ap¬ 
proach. Thus the correlations (r) between pairs of judges (r AB , r B c and r AC ) 
are .645, .582, and .800 respectively, and the mean intercorrelation is .676 
which differs by only .007 from the estimate (.669) obtained by means of 
the analysis of variance approach. 

If we were employing only the correlational approach we would apply 
the Spearman-Brown formula (equation 3.1) to our mean reliability of .676 
to find R, the effective reliability. The result is 

(3X.676) _ 

R = 1 + (3 - 1)(.676) “ 862 

which differs by only .004 from the estimate (.858) obtained by means of 
the analysis of variance approach. In general, the differences obtained be¬ 
tween the correlational approach and the analysis of variance approach are 
quite small (Guilford, 1954). 

It should be noted that in our present simple example, the correlational 
approach was not an onerous one to employ, with only three correlations to 
compute. As the number of judges increased, however, we would find our¬ 
selves more and more grateful for the analysis of variance approach. 

lI.C.2.a. Quality of research and effect size. As an additional example of 
the computation of reliability from analysis of variance, we examine some 
data summarized by Glass et al. (1981). For 11 different meta-analyses, sep¬ 
arate estimates of mean effect size were given for studies judged to be of 
high, medium, or low internal validity. On the basis of an analysis weighting 
the 11 studies equally, the mean effect sizes (A) were found to be ,42, .34, 
and .57 respectively, for the high, medium, and low quality studies. The 
analogous medians were .48, .31, and .59. These results suggest no very 
great linear effect of quality on mean effect size obtained. (Note that though 
“poor” studies tend to show larger effects, “good” studies tend to show 
larger effects than intermediate studies.) 
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The question we want to put to these data is: What is the degree of agree¬ 
ment, in the sense of reliability coefficients, among the three levels of qual¬ 
ity? We address the question via analysis of variance and find: 


MS studies - MS residual 


.1821 - .0571 


Therefore, the effective reliability of the differentiation of the 11 meta¬ 
analyses is seen to be about .69. Our estimate of the mean reliability (r), or 
the reliability of a single level of quality is: 


MS studies - MS residual 
MS studies -I- (n - 1)MS residual 


.1821 - .0571 
.1821 + (3 - 1).0571 


In this example, the correlation of the high quality with the medium 
quality studies was .558, while the correlations of high with low and me¬ 
dium with low quality studies were .381, and . 376 respectively. The mean of 
these three reliabilities was .438, a value quite close to that obtained from 
the analysis of variance (.422). All in all, the low quality studies do not 
agree as well with the others (the high and the medium) in differentiating 
the meta-analyses. 

ILC.3. Reliability and principal components. In situations where the rat 
ings made by all judges have been intercorrelated, and a principal compo¬ 
nents analysis is readily available, another very efficient alternative to esti¬ 
mate the reliability of the total set of judges is available. Armor (1974) has 
developed an index, theta (0), that is based on the unrotated first principal 
component (where a principal component is a factor extracted from a corre¬ 
lation matrix employing unity [1.00] in the diagonal of the correlation ma¬ 
trix). The formula for theta is 

theta (0) = - _ [- i j [3.4] 


where n is the number of judges and L is the latent root or eigenvalue of the 
first unrotated principal component. The latent root is the sum of the 
squared factor loadings for any given factor and can be thought of as the 
amount of variance in the judges’ ratings accounted for by that factor. Fac¬ 
tor analytic computer programs generally give latent roots or eigenvalues 
for each factor extracted so that 0is very easy to obtain in practice. 

II.C.4. Reporting reliabilities. Assuming we have done our reliability 
analyses well, how shall we report our results? Ideally, reports of reliability 
analyses should include both the mean reliability (the reliability of a single 
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judge) and the effective reliability (reliability of the total set of judges or of 
the mean judgments). The reader needs to know the latter reliability (R) 
because that is, in fact, the reliability of the variable employed in most 
cases. However, if this reliability is reported without explanation, the reader 
may not be aware that the reliability of any one judge’s ratings is likely to 
be lower, often substantially so. A reader may note a reported reliability of 
.80 based on 12 judges and decide that the variable is sufficiently reliable 
for his or her purposes. This reader may then employ a single judge only to 
find later that this single judge was operating at a reliability of . 25, not. 80. 
Reporting both reliabilities avoids such misunderstandings. 

II. CA.a. Split-sample reliabilities. A related source of misunderstanding is 
the reporting of correlations between a mean judge of one type with a mean 
judge of another type. For example, suppose we had 10 male and 10 female 
judges, or 10 student and 10 faculty judges. One sometimes sees in the litera¬ 
ture the reliability of the mean male and mean female judge or of the mean 
student and mean faculty judge. Such a correlation of the mean ratings made 
by all judges of one type with the mean ratings made by judges of another type 
can be very useful, but they should not be reported as reliabilities without the 
explanation that these correlations might be substantially higher than the av 
erage correlation between any one male and any one female judge or between 
any one student and any one faculty judge. The reasons for this are those 
discussed in the earlier section on effective reliability. 

ILCA.b. Trimming judges. It sometimes happens that when we examine 
the intercorrelations among our judges we find one that is very much out of 
line with all the others. Perhaps this judge tends to obtain negative correla¬ 
tions with other judges or at least to show clearly lower reliabilities with other 
judges than is typical for the correlation matrix. If this unreliable judge were 
dropped from the data, the resulting estimates of reliability would be biased, 
i.e., made to appear too reliable. If a judge must be dropped, the resulting bias 
can be reduced by equitable trimming. Thus if the lowest agreeing judge is 
dropped, the highest agreeing judge is also dropped. If the two lowest agree¬ 
ing judges are dropped, the two highest agreeing judges are also dropped and 
so on. Experience suggests that when large samples of judges are employed 
the effects of trimming judges are small as is the need for trimming. When the 
sample of judges is small, we may feel a stronger need to drop a judge, but 
doing so is more likely to leave a residual biased estimate of reliability. A safe 
procedure is to do all analyses with and without the trimming of judges and to 
report the differences in results from data with and without the trimming. 
Although the method of trimming judges seems not yet to have been system¬ 
atically applied, the theoretical foundations for the method can be seen in the 
writings of Mosteller and Rourke (1973), Tukey (1977), and Hoaglin, Mostel- 
ler, and Tukey (1983). 
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Comparing and Combining 
Research Results 


A framework for meta-analytic procedures is described in which the comparing function 
and the combining function of meta-analytic procedures are distinguished. Procedures 
are provided for comparing and for combining the tests of significance and the effect size 
estimates from two or more studies. 


L A FRAMEWORK FOR META-ANALYTIC PROCEDURES 

In this chapter we consider in detail the application of various meta- 
analytic procedures. Before we wax computational, however, it will be use¬ 
ful to consider a general framework for putting into perspective a variety of 
meta-analytic procedures. Table 4.1 provides a summary of four types of 
meta-analytic procedures that are applicable to the special case where just 
two studies are to be evaluated. It is useful to list the two-study case sepa¬ 
rately because there are some especially convenient computational proce¬ 
dures for this situation. The two columns of Table 4.1 show that there are 
two major ways to evaluate the results of research studies — in terms of their 
statistical significance (e.g., p levels) and in terms of their effect sizes (e.g., 
the difference between means divided by the common standard deviation a 
or S, indices employed by Cohen [1969, 1977, 1988] and by Glass [1980] 
and Hedges [1981], or the Pearson r). The two rows of Table 4.1 show that 
there are two major analytic processes applied to the set of studies to be 
evaluated: comparing and combining. The cell labeled A in Table 4.1 repre¬ 
sents the procedure that evaluates whether the significance level of one study 
differs significantly from the significance level of the other study. The cell 
labeled B represents the procedure that evaluates whether the effect size (e.g., 
d or r) of one study differs significantly from the effect size of the other study. 
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Cells C and D represent the procedures that are used to estimate the overall 
level of significance and the average size of the effect, respectively. Illustra¬ 
tions of these procedures will be given below. 

TABLE 4.1 

Four Types of Meta-Analytic Procedures 
Applicable to a Set of Two Studies 


Analytic Process 
Comparing studies 

Combining studies 


Results Defined in Terms of: 

S ignificance Effect size 

testing estimation 


A 

B 

C 

D 



TABLE 4.2 

Six Types of Meta-Analytic Procedures 
Applicable to a Set of Three or More Studies 


Analytic Process 

Comparing studies: 
Diffuse tests 

Comparing studies: 
Focused tests 

Combining studies 


Results Defined in Terms of: 


Significance 


Effect size 
estimation 


A 

B 

C 

D 

E 

F 



Table 4.2 provides a more general summary of six types of meta-analytic 
procedures that are applicable to the case where three or more studies are to 
be evaluated. The columns are as in Table 4.1 but the row labeled “Compar¬ 
ing Studies” in Table 4.1 has now been divided into two rows — one for the 
case of diffuse tests and one for the case of focused tests. 

When studies are compared as to their significance levels (Cell A) or 
their effect sizes (Cell B) by diffuse tests, we learn whether they differ sig¬ 
nificantly among themselves with respect to significance levels or effect 
sizes, respectively, but we do not learn how they differ or whether they 
differ according to any systematic basis. When studies are compared as to 
their significance levels (Cell C) or their effect sizes (Cell D) by focused 


COMPARING and combining results 


61 


tests, or contrasts, we learn whether the studies differ significantly among 
themselves in a theoretically predictable or meaningful way. Thus impor¬ 
tant tests of hypotheses can be made by the use of focused tests. Cells E and 
Fof Table 4.2 are simply analogues of Cells C and D of Table 4.1 represent¬ 
ing procedures used to estimate overall level of significance and average 
size of the effect, respectively. 

II. META-ANALYTIC PROCEDURES: 

TWO INDEPENDENT STUDIES 

Even when we have been quite rigorous and sophisticated in the interpre¬ 
tation of the results of a single study, we are often prone to err in the inter¬ 
pretation of two or more studies. For example, Smith may report a signifi¬ 
cant effect of some social intervention only to have Jones publish a rebuttal 
demonstrating that Smith was wrong in her claim. A closer look at both 
their results may show the following: 

Smith’s Study: t(78) = 2.2l, p < .05, d = .50, r = .24. 

Jones’s Study: t(18) = 1.06, p> .30, d = .50, r= .24. 

Smith’s results were more significant than Jones’s, to be sure, but the stud¬ 
ies were in perfect agreement as to their estimated sizes of effect defined by 
either d or r. A further comparison of their respective significance levels re¬ 
veals that these p’s are not significantly different (p = .42). Clearly Jones was 
quite wrong in claiming that he had failed to replicate Smith’s results. We 
shall begin this section by considering some procedures for comparing quan¬ 
titatively the results of two independent studies, i.e., studies conducted with 
different research participants. The examples in this chapter are in most cases 
hypothetical, constructed specifically to illustrate a wide range of situations 
that occur when working on meta-analytic problems. 

II. A. Comparing Studies 

ILA.L Significance testing. Ordinarily when we compare the results of 
two studies we are more interested in comparing their effect sizes than their 
p values. However, sometimes we cannot do any better than comparing their 
p values and here is how we do it (Rosenthal & Rubin, 1979a): For each of 
the two test statistics we obtain a reasonably exact one-tailed p level. All of 
the procedures described in this chapter require that p levels be recorded as 
one-tailed. Thus t(100) = 1.98 is recorded as p = .025, not p = .05. Then as 
an illustration of being reasonably exact, if we obtain t(30) = 3.03 we give p 
as .0025, not as “< .05.” Extended tables of the t distribution are helpful 
here (e.g., Federighi, 1959; Rosenthal & Rosnow, 1984a; 1991); as are 
inexpensive calculators with built-in distributions of Z, t, F, and y 2 . For each 
p, we find Z, the standard normal deviate corresponding to the p value. Since 
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both p’s must be one-tailed, the corresponding Z’s will have the same sign if 
both studies show effects in the same direction but different signs if the results 
are in the opposite direction. The difference between the two Z’s when 
divided by VT yields a new Z that corresponds to the p value that the 
difference between the Z’s could be so large, or larger, if the two Z’s did not 
really differ. Recapping, 

z ' ~ Z z is distributed as Z [4.11 


Example 1. Studies A and B yield results in opposite directions and nei¬ 
ther is “significant.” One p is .06, one-tailed, the other is .12, one-tailed but 
in the opposite tail. The Z’s corresponding to these p’s are found in a table 
of the normal curve to be +1.56 and -1.18. (Note the opposite signs to 
indicate results in opposite directions.) Then, from the preceding equation 
(4.1) we have 




Z : - Z 2 (1.56) - (-1.18) 


as the Z of the difference between the two p values or their corresponding 
Z’s. The p value associated with a Z of 1.94 is .026 one-tailed or .052 two- 
tailed. The two p values may be seen to differ significantly, suggesting that 
we may want to draw different inferences from the results of the two studies. 

Example 2. Studies A and B yield results in the same direction and both 
are significant. One p is .04, the other is .000025. The Z’s corresponding to 
these p’s are 1.75 and 4.06. (Since both Z’s are in the same tail they have the 
same sign.) From equation 4.1 we have 

Zj - Z 2 (4.06) - (1.75) 


as our obtained Z of the difference. The p associated with that Z is .05 
one-tailed or .10 two-tailed, so we may want to conclude that the two p values 
differ significantly or nearly so. It should be emphasized, however, that 
finding one Z greater than another does not tell us whether that Z was 
greater because the size of the effect was greater, the size of the study (e.g., 
N) was greater, or both. 

Example 5. Studies A and B yield results in the same direction, but one is 
“significant” (p = .05) and the other is not (p = .06). This illustrates the 
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worst case scenario for inferential errors where investigators might con¬ 
clude that the two results are inconsistent because one is significant and the 
other is not. Regrettably, this example is not merely theoretical. Just such 
errors have been made and documented (Rosenthal & Gaito, 1963, 1964). 
The Z’s corresponding to these p’s are 1.64 and 1.55. From equation 4.1 we 
have 

Z, - Z 2 (1.64) - (1.55) 

--- - -= .06 

VT 1.41 

as our obtained Z of the difference between a p value of .05 and .06. The p 
value associated with this difference is .476 one-tailed or .952 two-tailed. 
This example shows clearly just how nonsignificant the difference between 
significant and nonsignificant results can be. 

II.A.2. Effect size estimation. When we ask whether two studies are tell¬ 
ing the same story, what we usually mean is whether the results (in terms of 
the estimated effect size) are reasonably consistent with each other or 
whether they are significantly heterogeneous. The present chapter will em¬ 
phasize r as the effect size indicator but analogous procedures are available 
for comparing such other effect size indicators as Hedge’s (1981) g or differ¬ 
ences between proportions, d' (Hedges, 1928b; Hsu, 1980; Rosenthal & 
Rubin, 1982a). These will be described and illustrated shortly. 

For each of the two studies to be compared we compute the effect size r 
and find for each of these r’s the associated Fisher z r defined as Vi log e [(1 + 
r)/(l - r)]. Tests of the significance of differences between r’s are more 
accurate when this transformation is employed (Alexander, Scozzaro, & 
Borodkin, 1989). In addition, equal differences between any pair of Fisher 
z r ’s are equally detectable, a situation that does not hold for untransformed 
r’s. Tables to convert our obtained r’s to Fisher z r ’s are available in most 
introductory textbooks of statistics. Then, when Nj and N 2 represent the 
number of sampling units (e.g., subjects) in each of our two studies, the 
quantity 



is distributed as Z (Snedecor & Cochran, 1967,1980, 1989). 

Example 4. Studies A and B yield results in opposite directions with 
effect sizes of r = .60 (N = 15) and r = —.20 (N = 100), respectively. The 
Fisher z r ’s corresponding to these r’s are .69 and - .20, respectively. (Note 
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the opposite signs of the z/s to correspond to the opposite signs of the r’s.) 
Then from the preceding equation (4.2) we have 

Z ri -Z r2 (.69)-(-.20) 

/j_ + _l. 

_ / M — 'X M __ 7 ^ / 12 97 


as the Z of the difference between the two effect sizes. The p value associ¬ 
ated with a Z of 2.91 is .002 one-tailed or .004 two-tailed. These two effect 
sizes, then, differ significantly. 

Examples. Studies A and B yield results in the same direction with effect 
sizes of r = .70 (N = 20) and r = .25 (N = 95), respectively. The Fisher z r ’$ 
corresponding to these r’s are .87 and .26, respectively. From equation 4.2 
we have 

(•87) ~ (.26) « 2.31 

Am - 

V 17 92 

as our obtained Z of the difference. The p associated with that Z is .01 one- 
tailed or .02 two-tailed. Here is an example of two studies that agree on a 
significant positive relationship between variables X and Y but disagree 
significantly in their estimates of the size of the relationship. 

Example 6. Studies A and B yield effect size estimates of r = .00 (N =f 
17) and r = .30 (N = 45), respectively. The Fisher z r ’s corresponding to 
these r’s are .00 and .31, respectively. From equation 4.2 we have 

(.00) - (.31) a -1.00 


as our obtained Z of the difference between our two effect size estimates 
The p associated with that Z is .16 one-tailed or .32 two-tailed. Here we 
have an example of two effect sizes, one zero (r — .00), the other ( r = .30) 
significantly different from zero (t(43) = 2.06, p < .025 one-tailed), but 
which do not differ significantly from one another. This illustrates how 
careful we must be in concluding that results of two studies are heteroge¬ 
neous just because one is significant and the other is not or because one has a 
zero estimated effect size and the other does not (Rosenthal & Rosnow, 
1984a, 1991). 
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U.A.2.a. Other effect size estimates. Although r is our preferred effect 
size estimate in this chapter, analogous procedures are available for such 
other effect size estimates as — M 2 )/S (Hedges’s g) or the difference 
between proportions, d'. We begin with the case of Hedges’s g. 

For each of the two studies to be compared, we compute the effect size 
(Nl { - M 2 )/S (Hedges’s g) and the quantity 1/w which is the estimated 
variance of g. We obtain w as follows (Rosenthal & Rubin, 1982a). 

w = 2 (nin 2 )(n 1 + n 2 - 2 ) [4 3 ] 

(ni + n 2 )[t 2 + 2 (n, + n 2 - 2 )] 

When we have w we can test the significance of the difference between 
any two independent g’s by means of a Z test since 

§A “ gB \A 41 


V W A w B 

is distributed as Z, as shown in somewhat different form in Rosenthal and 
Rubin (1982a). Note the similarity in structure between equations 4.4 and 
4.2. In both cases the differences in effect size are divided by the square 
root of the sums of the variances of the individual effect sizes. 

Example 7. Studies A and B yield results in the same direction with effect 
sizes of g = 1.86 (t = 4.16; N = 20) and g = .51 (t = 2.49; N = 95), 
respectively. Assuming that the two conditions being compared within each 
study are comprised of sample sizes of 10 and 10 in Study A and 47 and 48 in 
Study B, we first find w for each study. 


2(n,n2Xni + n 2 - 2) 


2 ( 10 )( 10)(10 + 10 - 2 ) 


Wa = 1 “2 ^ _ — 3 38 

(n, + n 2 )[t 2 + 2 ( 11 ! + n 2 - 2)] (10 + 10)[(4.16) 2 + 2(10 + 10 - 2)] 


2 (n 1 n 2 )(n ] + n 2 - 2 ) 


2(47)(48)(47 + 48 - 2) 


(n, + n 2 )[t 2 + 2(n, + n 2 - 2)) (47 + 48)[(2.49) 2 + 2(47 + 48 - 2)1 

Therefore, from equation 4.4: 


g A ~ gB = L 86 - .51 

/T7T / \ + 1 

1 w A Wo V 3.38 J 22.98 


as our obtained Z of the difference. The p associated with that Z is .01 one- 
tailed or .02 two-tailed. Here is an example of two studies that agree there is 
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a significant effect of the independent variable, but disagree significantly i n 
their estimates of the size of the effect. 

Suppose that in the present example we had found Studies A and B but 
that no effect sizes had been computed — only t tests. If our preference were 
to work with r as our effect size estimate we could get r from equation 2.16, 
Recall that t’s and N’s for these studies were 4.16 (N = 20) and 2.49 (N = 
95), respectively; then we can get the two r’s: 



r A ~ 


r B ; 


t 2 4- df 


(4.16) 2 


t 2 « 


t 2 + df 


(4.16) 2 + 18 

r (2.49)2 
(2.49) 2 + 93 


= .70 


.25 


We could compare these r’s easily; in fact we did so in example 5. The Z 
we obtained there was 2.31, very close to the Z we obtained when compar¬ 
ing g’s(Z= 2.32). 

Now suppose we had remembered how to get r from t but had forgotten 
how to compare two r’s. If we recalled how to compare two g’s we could 
convert our r’s to g’s by means of equation 2.27: 


g = 


VT 


/df(n i + n?) 

n i n 2 


[2.27] 


For the present example: 


gA = 


gB ~ 


.70 x /l8(10 + 10) = 1,86 
Vl - (.70)2 V (10)(10) 


.25 


_x /93(47 + 48) = , 5 | 

Vl - (.25)2 V (47X48) 


Of course, we could also have computed g directly from t by means of 
equations 2.25 (or 2.26, or 2.5). From equation 2.25 we have: 


gA = t /L + L = 4.16 


n, n 2 


/!_ + -L = 1.86 
10 10 


8b = 1 


L + _L - 2.49 


1 


.51 


47 48 


Finally, if we should have Cohen’s d available [(Mj - M 2 )/c r ] 
wanted to get g we could do so as follows: 
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n l + »2 

ni + n 2 - 2 


[4.5] 


If our effect size estimate were the difference between proportions (d'), 
our procedure would be analogous to that when our effect size estimate was 
Hedges’s g. Again we need, the estimated variance of the effect size esti¬ 
mate, 1/w. In this application we estimate w by equation 4.6 which works 
well unless nj or n 2 is very small and pi or P 2 is very close to zero or one. If 
n, or n 2 is very small, a conservative procedure is to replace p (1 - p) by 
its maximal possible value of .25 (i.e., when p = (1 - p) = .50 we find p 
(1 - p) to be at a maximum and equal to .25). 


n l n 2 


n 2Pi0 “ Pi) + n,p 2 (l - pa) 


[4.6] 


In meta-analytic work, however, we are sometimes unable to obtain the 
values of n, and n 2 . Accordingly we employ an approximation to w that 
depends only on the total study size N and the effect size estimate d' (Rosen¬ 
thal & Rubin, 1982a): 


= N 


1 - d’ 2 


[4.7] 


This approximation to equation 4.6 holds exactly when p, and p 2 are the 
same amount above and below .5 and when ni = n 2 . 

When we have w we can test the significance of the difference between 
any two independent d' ’s by means of a Z test since 

11 

imt- 


d' A - d 1 


S__ is distributed as Z 


, + 


i 


[4.8] 


as shown in somewhat different form in Rosenthal and Rubin (1982a). Just 
as was the case when effect size estimates were r and g (equations 4.2 and 
4.4), the differences in effect size are divided by the square root of the sums 
of the variances of the individual effect sizes. 

Example 8. Studies A and B yield results in the same direction with effect 
sizes of d' = .70 (N = 20) andd' = .25 (N = 95), respectively. Assuming 
that the two conditions being compared within each study are comprised of 
sample sizes of 10 and 10 in Study A and 47 and 48 in Study B, we find w 
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first from equation 4.6. Then, as a further illustration, we also employ the 
approximation equation 4.7: 


Pi) + n, p 2 ( I _ P 2 ) 


( 10 ) 00 ) _ 

(10).85(.15) + (10). 15(.85) 


Wfi = _ ^2 _« _ (47)(48) _ - 101.32 

n 2 pi(l pj) + njp 2 (l — P 2 ) (48).375(.625) *t* (47).625(.375) 

Wa _-^ = -—-= 39.22, agreeing perfectly with the result 

1 - d' 2 l -- (.70) 2 above (waj). 

N 95 

Wfi =---- 101.33, disagreeing only in the second 

1 - d' 2 1 - ( 25) 2 decimal place with the result above 

(w Bl ) because this approximation (w Bl ) 
assumed nj ~ n 2 = 47.5 rather than 
ni = 47 and n 2 = 48 as in the result 
above (w Bl ). 

Now, we can test the difference between our two effect sizes from equa¬ 
tion 4.8: 


.70 - .25 

/J— + — 

39.22 101.32 


as our obtained Z of the difference. The p associated with that Z is .0084 
one-tailed or .017 two-tailed. This example, example 8, was selected to re¬ 
flect the same underlying effect size as example 7 and example 5. The three 
Z’s found by our three methods agreed very well with one another with Z’s 
of 2.39, 2.32, and 2.31, respectively. 

II. B. Combining Studies 

II.B.l. Significance testing. After comparing the results of any two inde¬ 
pendent studies, it is an easy matter to combine the p levels of the two 
studies. Thus we get an overall estimate of the probability that the two p 
levels might have been obtained if the null hypothesis of no relation be¬ 
tween X and Y were true. Many methods for combining the results of two or 
more studies are available; they will be described later and have been sum¬ 
marized elsewhere (Rosenthal, 1978, 1980). Here it is necessary to give 
only the simplest and most versatile of the procedures, the method of add¬ 
ing Z’s called the Stouffer method by Mosteller and Bush (1954). This 
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fliethod, like the method of comparing p values, asks us first to obtain accu¬ 
rate p levels for each of our two studies and then to find the Z corresponding 
to each of these p levels. Both p’s must be given in one-tailed form and the 
corresponding Z’s will have the same sign if both studies show effects in the 
same direction. They will have different signs if the results_are in the oppo¬ 
site direction. The sum of the two Z’s when divided by V2, yields a new Z. 
This new Z corresponds to the p value that the results of the two studies 
combined (or results even further out in the same tail) could have occurred 
if the null hypothesis of no relationship between X and Y were true. Recap¬ 
ping, 

Z L + . Z - 2 is distributed as Z [4.9] 

vr 

We could weight each Z by its df, its estimated quality, or any other desired 
weights (Mosteller & Bush, 1954; Rosenthal, 1978, 1980). 

The general procedure for weighting Z’s is to multiply each Z by any 
desired weight (assigned before inspection of the data), add the weighted 
Z’s and divide the sum of the weighted Z’s by the square root of the sum of 
the squared weights as follows: 


Weighted Z = 


w jZ| + w 2 Z 2 


Example 11 will illustrate the application of this procedure. 

Example 9. Studies A and B yield results in opposite directions and both 
are significant. One p is .05, one-tailed, the other is .0000001, one-tailed 
but in the opposite tail. The Z’s corresponding to these p’s are found in a 
table of normal deviates to be — 1.64 and 5.20, respectively. (Note the oppo¬ 
site signs to indicate results in opposite directions.) Then from equation 4.9 
we have 

Zi + Z 2 (-1.64) + (5.20) 


as the Z of the combined results of Studies A and B. The p value associated 
with a Z of 2.52 is .006 one-tailed or .012 two-tailed. Thus the combined p 
supports the result of the more significant of the two results. If these were 
actual results we would want to be very cautious in interpreting our com¬ 
bined p both because the two p’s were significant in opposite directions and 
because the two p’s were so significantly different from each other. We 
would try to discover what differences between Studies A and B might have 
led to results so different. 
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Example 10. Studies A and B yield results in the same direction but nef I 
ther is significant. One p is .11, the other is .09 and their associated Z’s are 
1.23 and 1.34, respectively. From equation 4.9 we have 


(1.23) + (1.34) 
1.41 


- 1.82 




as our combined Z. The p associated with that Z is .034 one-tailed or .068 
two-tailed. 

Example 11. Studies A and B are those of example 9 but now we have 
found from a panel of experts that Study A earns a weight (w^ of 3.4 on 
assessed internal validity while Study B earns only a weight (w 2 ) of 0.9. The 
Z’s for Studies A and B had been -1.64 and 5.20 respectively. Therefore, 
employing equation 4.10 we find 


(3.4)(—1.64) + (0.9)(5.20) -0.896 

3.517 


-0.25 


V(3.4) 2 + (0.9) 2 


as the Z of the combined results of Studies A and B. The p value associated 
with this Z is .40 one-tailed or .80 two-tailed. Note that weighting has led to 
a nonsignificant result in this example. In example 9 where there was no 
weighting (or, more accurately, equal weighting with Wj = w 2 = 1), thep 
value was significant at p = .012 two-tailed. 

If the weighting had been by df rather than research quality, and if df for 
Studies A and B had been 36 and 144 respectively, the weighted Z would 
have been 


(36)(—1.64) + (144)(5.20) 689.76 


V(36) 2 + (144) 2 


148.43 


4.65 


This result shows the combined Z (p < .000002 one-tailed) to have been 
moved strongly in the direction of the Z with the larger df because of the 
substantial difference in df between the two studies. Note that when weight¬ 
ing Z’s by df we have decided to have the size of the study play a very large 
role in determining the combined p. The role is very large because the size 
of the study has already entered into the determination of each Z and is j 
therefore entering a second time into the weighting process. 

1LB.2. Effect size estimation. When we want to combine the results of 
two studies, we are at least as interested in the combined estimate of the 
effect size as we are in the combined probability. Just as was the case when | 
we compared two effect size estimates, we shall consider r as our primary 
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effect size estimate in the combining of effect sizes. However, many other 
estimates are possible (e.g., Cohen’s d, Hedges’s g, or Glass’s A, or differ¬ 
ences between proportions, d'). 

For each of the two studies to be combined, we compute r and the associ¬ 
ated Fisher z r and have 


Z rj + Z r 2 = z r 


[4.11] 


as the Fisher z r corresponding to our mean r. We use an r to z r or z r to r table 
to look up the r associated with our meanz r . Tables are handier than compu¬ 
ting r from z r from the following: r = (e 2z r - 1)V (e 2z * + 1) where e s 
2.71828, the base of the system of natural logarithms. Should we want to do 
so we could weight each z r by its df, i.e., N - 3 (Snedecor & Cochran, 1967; 
1980), by its estimated research quality, or by any other weights assigned 
before inspection of the data. 

The weighted mean z r is obtained as follows: 


weighted mean z r — 


w,z r , + w 2 z f2 
w 1 + w 2 


[4.12] 


Example 14 will illustrate the application of this procedure. 

Example 12. Studies A and B yield results in opposite directions, one r = 
.80, the other r = -.30. The Fisher z r ’s corresponding to these r’s are 1.10 
and -0.31, respectively. From equation (4.11) we have 


+ z r . 


(1.10) + (-0.31) 


1 


.395 


1 


as the mean Fisher z r . From our z r to r table we find a z r of .395 associated 
with an r of .38. 

Example 13. Studies A and B yield results in the same direction, one r — 
.95, the other r — .25. The Fisher z r ’s corresponding to these r’s are 1.83 
and .26, respectively. From equation (4.11) we have 


1.83 + .26 


= 1.045 


as the mean Fisher z r . From our z r to r table we find a z r of 1.045 to be 
associated with an r of .78. Note that if we had averaged the two r’s without 
first transforming them to Fisher z r ’s we would have found the mean r to be 

I 
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(.95 + .25)/2 = .60, substantially smaller than .78. This illustrates that the 
use of Fisher’s z r gives heavier weight to r’s that are further from zero i n 
either direction. 

Example 14. Studies A and B are those of example 6 but now we have 
decided to weight the studies by their df (i.e., N — 3 in this application), 
Therefore, equation 4.12 can be rewritten to indicate that we are using df as 
weights as follows: 



weighted z T = 


dfjz ri + df 2 z r2 
d^ + df 2 


In example 6 we had r’s of .00 and .30 based on N’s of 17 and 45, respec¬ 
tively. The Fisher z r ’s corresponding to our two r’s are .00 and .31. There¬ 
fore, we find our weighted z r to be 

(17 - 3).00 + (45 - 3).31 13.02 

---= .232 

(17 - 3) + (45- 3) 56 


which corresponds to an r of .23. 

Finally, it should be noted that before combining tests of significance 
and/or effect size estimates, it is very useful first to test the significance of 
the difference between the two p values or, what is preferable if they are 
available, the two effect sizes. If the results of the studies do differ we 
should be most cautious about combining their p values or effect sizes- 
especially when their results are in opposite directions. 

II.B.2 . a . Other effect size estimates. All that has been said about the com¬ 
bining of r’s applies in principle also to the combining of other effect size 
estimates. Thus we can average Hedges’s g, or Cohen’s d, or Glass’s A, or the 
difference between proportions, d', or any other effect size estimate, with or 
without weighting. The difference in practice is that when we combine r’s we 
typically transform them to Fisher’s z r ’s before combining, while with most 
other effect size estimates we do not transform them before combining them. 


HI. META-ANALYTIC PROCEDURES: 

ANY NUMBER OF INDEPENDENT STUDIES 

Although we can do quite a lot in the way of comparing and combining 
the results of sets of studies with the procedures given so far, it often hap 
pens that we have three or more studies of the same relationship that w< 
want to compare and/or combine. The purpose of this section is to present 
generalizations of the procedures given in the last section so that we car 
compare and combine the results of any number of independent studies. 
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Again, the examples are hypothetical, constructed to illustrate a wide range 
0 f situations occurring in meta-analytic work in any domain. Often, of 
course, the number of studies entering into our analyses will be larger than 
the number required to illustrate the various meta-analytic procedures. 

IIL A. Comparing Studies: Diffuse Tests 

IJI.AL Significance testing. Given three or more p levels to compare we 
first find the standard normal deviate, Z, corresponding to each p level. All 
p levels must be one-tailed and the corresponding Z’s will have the same 
sign if all studies show effects in the same direction, but different signs if the 
results are not all in the same direction. The statistical significance of the 
heterogeneity of the Z’s can be obtained from a x 2 computed as follows 
(Rosenthal & Rubin, 1979a): 

£(Zj “ Z) 2 is distributed as x 2 with K - 1 df [4.14] 

In this equation Zj is the Z for any one study, Z is the mean of all the Z’s 
obtained, and K is the number of studies being combined. 

Example 15. Studies A, B, C, and D yield one-tailed p values of .15, .05, 
.01, and .001, respectively. Study C, however, shows results opposite in 
direction from those of studies A, B, and D. From a normal table we find the 
Z’s corresponding to the four p levels to be 1.04, 1.64, -2.33, and 3.09. 
(Note the negative sign for the Z associated with the result in the opposite 
direction.) Then, from the preceding equation 4.14 we have 

S(Zj - Z) 2 - [(1.04) - (0.86)] 2 + [(1.64) - (0.86)] 2 + [(-2.33) - (0.86)] 2 
+ [(3.09) - (0.86)] 2 = 15.79 

as our x 2 value which for K — 1—4 — 1 ~ 3 df is significant at p = .0013. 
The four p values we compared, then, are clearly significantly heterogeneous. 

I1LA.2. Effect size estimation. Here we want to assess the statistical he¬ 
terogeneity of three or more effect size estimates. We again emphasize r as 
the effect size estimator, but analogous procedures are available for com¬ 
paring such other effect size estimators as Hedges’s (1981) g or differences 
between proportions (Hedges, 1982b; Hsu, 1980; Rosenthal & Rubin, 
1982a). These will be described and illustrated shortly. 

For each of the three or more studies to be compared we compute the 
effect size r, its associated Fisher z r , and N — 3, where N is the number of 
sampling units on which each r is based. Then the statistical significance of 
the heterogeneity of the r’s can be obtained from a x 2 (Snedecor & Cochran, 
1967,1980, 1989) because 
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2(Nj - 3)(z r j - z x ) 2 is distributed as x 2 with K — 1 df [4-15] 

In this equation z r . is the Fisher z r corresponding to any r, and^ r is the 
weighted mean z p i.e., 

7 _ s:< N j z 3 > z 9 

S(Nj - 3) [4.16] 

Example 16. Studies A, B, C, and D yield effect sizes of r = .70 (N = 30), 
r = .45 (N - 45), r = .10(N = 20) and r = - .15 (N = 25), respectively. The 
Fisher z r ’s corresponding to these r’s are found from tables of Fisher z r to be 
.87, .48, .10, and - .15, respectively. The weighted mean z r is found from 
the equation just above (4.16) to be 


[27(.87) + 42(,48) 4- 17(.10) 4- 22(—.15)] 42.05 


[27 + 42 4- 17 4- 22] 


108 


.39 


Then from the equation for x 2 above (equation 4.15) we have 

2(Nj - 3)(z r j - Zr ) 2 = 27(.87 - .39) 2 + 42(.48 - .39) 2 4- 17(.10 - .39) 2 
4- 22(-. 15 - ,39) 2 = 14.41 

as our x 2 value which for K - l = 3 df is significant at p - .0024. The four 
effect sizes we compared, then, are clearly significantly heterogeneous. 

III.A.2m. Other effect size estimates. Although r is our preferred effect 
size estimate in this chapter, analogous procedures are available for such 
other effect size estimates as (M t - M 2 )/S (Hedges’s g) or the difference 
between proportions (d'). We begin with the case of Hedges’s g. 

For each of the studies in the set we compute Hedges’s g [(Mj — M 2 )/S] 
and the reciprocal (w) of the estimated variance of g (1/w). We saw in 
equation 4.3 how to compute w (Rosenthal & Rubin, 1982a): 
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[4.18] 


Note the similarity in structure between equations 4.17 and 4.15 and be- 
tween 4.18 and 4.16. Equation 4.17 will be an adequate approximation in 
niost circumstances but it will lose some accuracy when sample sizes are 
very small and t statistics are large. 

Example 17. Studies A, B, C, and D yield effect sizes of g = 1.89 (N = 
30), g = -99 (N = 45), g = .19 (N = 20) and g = -.29 (N = 25), respec¬ 
tively. To employ equations 4.17 and 4.18 we will need to compute w for 
each effect size. Equation 4.3 showing how to compute w requires knowing 
the sample sizes of the two groups being compared in each study and n 2 ) 
as well as the results of the t test. If the t tests were not available we could 
compute our own from equations 2.4, 2.5, 2.25, or 2.26, for example: 


'll 

n I n 2 


[4.19] 


If the nj and n 2 values are not reported but N (i.e., nj + n 2 ) is known and if it 
is reasonable to assume approximately equal sample sizes, we can replace 
- 1| and n 2 by N/2. In that case equation 4.19 simplifies to 


and equation 4.3 simplifies to 


2(n 1 n 2 )(n 1 4- n 2 - 2) 

(nj + n 2 )[t 2 4- 2(n 1 4- n 2 - 2)] 




[4.3] 


Once we have w we can test the heterogeneity of the set of g’s because 
Hedges (1982b) and Rosenthal and Rubin (1982a) have shown that 


£wj(gj — g) 2 is distributed approximately as x 2 with K — 1 df 
The quantity g is the weighted mean g defined as 


[4.17] 

■•im 


t — g x 


VTT 


[4.20] 


N(N - 2) 
2(t 2 4- 2N - 4) 


[4.21] 


Since in the present example we were not given n x , n 2 , or t for studies A, 
B, C, and D, we employ equation 4.20 to obtain t and equation 4.21 to obtain 
w for each study. Table 4.3 shows the results of these computations which 
are shown in detail only for Study A for which N = 30 and g— 1.89. From 
equation 4.20 we find 


t = g x 


Vn 


V3tT 


= 1.89 x 


■ = 5.18 
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TABLE 4.3 

Work Table for Comparing Four Effect Sizes (g) 


1 

- 



a. Obtainable from; g = 2t (from equation 4.20). 

Vn~ 


b. Obtainable from: t = 


c. Obtainable from: w 


9 (equatio 
2 

s N(N-2) 
2(t* + 2N - 4) 


(equation 4.20). 

-(equation 4.21). 


From equation 4.21 we find: 


N(N - 2) 30(28) 

w ~ -~ -= 5.07 

2(t 2 + 2N - 4) 2(5.182 + 2(30) - 4) 


m 
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em to g’s, still assuming approximately equal sample sizes within each con¬ 
dition, we can simplify the conversion equation 2.27 to the following: 


Should we want to convert g’s to r’s we can analogously simplify the conver¬ 
sion equation 2.28 to the following: 

r= / g 2 N 

f V g 2 N + 4(N - 2) [4.23] 


I 




Before we can employ equation 4.17, our x 2 test for heterogeneity, we must 
find g, the weighted mean g (see equation 4.18), which can be found from 
the appropriate entries in the row of sums of Table 4.3: 


If our effect size estimate were the difference between proportions (d'), 
our procedure would be analogous to that when our effect size estimate was 
Hedges’s g. For each of the studies in the set we compute d'and the recipro¬ 
cal (w) of the estimated variance of d' (1/w). The basic estimate of w is 
provided by equation 4.6 which works well unless n x or n 2 is very small and 
pj or p 2 is very close to zero or one. If nj or n 2 is very small, a conservative 
procedure is to replace p(l — p) by its maximal possible value of .25. We 
give equation 4.6 again: 

n i n 2 

w — - 

n 2PiO - Pi) + nip 2 (l - p 2 ) [4.6] 

The approximation to this expression that depends only on the total study 
size (N) and the effect size estimate d' was given earlier as equation 4.7: 


Now we can employ equation 4.17 to compute x 2 - 

2>j(gj - g)2 « 5.07(1.89 - .71)2 + 9.97(.99 - .71) 2 + 4.98(.19 - .71) 2 + 
6.18(—.29 - .71) 2 = 15.37 

a x 2 value which, for K - 1 = 3 df, is significant at p = .0015. The four 
effect sizes we compared, then, are clearly significantly heterogeneous. 

The four effect sizes of this example were chosen to be the equivalents in 
units of g to the effect sizes of example 16 which were in units of r. The x 2 (3) 
based on g was somewhat larger (by 7%) than the x 2 (3) based on r and the pof 
.0015 is slightly more significant than that for example 16 (.0024). The agree¬ 
ment is close enough for practical purposes but we should not expect perfect 
agreement. Incidentally, if we have available a set of r’s and want to convert 


This approximation to equation 4.6 holds exactly when pi and p 2 are the 
same amount above and below .5 and when nj — n 2 . 

Once we have w we can test the heterogeneity of the set of d' ’s by means 
of equation 4.17 (Rosenthal & Rubin, 1982a) but substituting d' for g: 

£wj(d' — d') 2 is distributed approximately 

as x 2 with K— 1 df. [4.24] 

The quantity d' is the weighted mean d' defined as: 

2 w jd'j 
d = - 


a quantity defined analogously to g (see equation 4.18). 






TABLE 4. 
Work Table for Comparing I 


Study 

N 

d' 

d' 2 

i - d' 2 

w a 

wd’ 

A 

30 

.70 

.4900 

.5100 

58.82 

41.174 

B 

45 

.45 

.2025 

.7975 

56.43 

25.394 

C 

20 

.10 

.0100 

.9900 

20.20 

2.020 

D 

25 

-.15 

.0225 

.9775 

25.58 

-3.837 

2 

120 

1.10 

.7250 

3.2750 

161.03 

64.751 

a. Obtainable from: w = _ 

N 

(equation 4.7). 




Example 18. Studies A, B, C, and D yield effect sizes of d' = .70, .45, 
.10, and -.15, respectively. Table4.4shows the results of the computations 
of w for each of the studies. To illustrate these computations for Study A we 
employ equation 4.7 as follows: 


l - d'2 l - (.70) 2 


Before we can employ equation 4.24, our x 2 test for heterogeneity, 
must find d', the weighted mean d' (equation 4.25), which can be found 
from the appropriate entries in the row of sums of Table 4.4: 

£wjd'j 64.751 

d' =-= -- .40 

5>j 161.03 

TABLE 4.5 

Tests for the Heterogeneity of Effect Sizes 
Defined as r, g, and d' 


Effect Sizes 
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Then, employing equation 4.24 we find: 

Jwjtd' - d') 2 = 58.82(.70 - .40) 2 + 56.43(.45 - .40) 2 

+ 20.20(.10 - .40) 2 + 25.58(-.15 - .40) 2 = 14.99 

Jf a ^2 value which, for K - 1 = 3 df is significant at p = .0018. The four effect 
sizes are significantly heterogeneous. 

The four effect sizes of this example were chosen to be the equivalents in 
units of d' to the effect sizes of example 16 (r) and example 17 (g). Table 4.5 
summarizes the data for the three effect size estimates of examples 16,17, 
and IB. While the three x 2 (3) values are not identical, they are quite similar 
to one another as are the three significance levels. Table 4.5 also suggests 
that the metric r is quite similar to the metric d'. Indeed, we shall see in the 
final chapter of this book that when the proportions being compared are the 
same amount above and below .5 and when nj = n 2 , r computed from such a 
2X2 table does indeed equal d'. 

III.B. Comparing Studies: Focused Tests 

IILB.L Significance testing . Although we know how to answer the dif¬ 
fuse question of the significance of the differences among a collection of 
significance levels, we are often able to ask a more focused and more useful 
question. For example, given a set of p levels for studies of teacher expect¬ 
ancy effects, we might want to know whether results from younger children 
show greater degrees of statistical significance than do results from older 
children (Rosenthal & Rubin, 1978). Normally our greater interest would be 
in the relation between our weights derived from theory and our obtained 
effect sizes. Sometimes, however, the effect size estimates, along with their 
sample sizes, are not available. More rarely, we may be intrinsically inter¬ 
ested in the relation between our weights and the obtained levels of signifi¬ 
cance. 

As was the case for diffuse tests, we begin by finding the standard normal 
deviate, Z, corresponding to each p level. All p levels must be one-tailed, 
and the corresponding Z*s will have the same sign if all studies show effects 
in the same direction. The statistical significance of the contrast testing any 
specific hypothesis about the set of p levels can be obtained from a Z com¬ 
puted as follows (Rosenthal & Rubin, 1979a): 


Median 

Unweighted mean 
Weighted mean 


a. Based on Fisher’s z r transformation. 




is distributed as Z 


In this equation is the theoretically derived prediction or contrast weight 
for any one study, chosen such that the sum of the Aj’s will be zero, and Zj is 
the Z for any one study. 
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Example 19. Studies A, B, C, and D yield one-tailed p values of 1/1Q7 
.0001, .21, and .007, respectively, all with results in the same direction! 
From a normal table or from a calculator with a built-in Z distribution we 
find the Z’s corresponding to the four p levels to be 5.20, 3.72, .81, and 2.45, 
Suppose that Studies A, B, C, and D had involved differing amounts of p eer 
tutor contact such that Studies A, B, C, and D had involved 8, 6, 4, and 2 
hours of contact per month, respectively. We might, therefore, ask whether 
there was a linear relationship beetween number of hours of contact and 
statistical significance of the result favoring peer tutoring. The weights of a 
linear contrast involving four studies are 3,1,-1, and -3. (These are obtained 
from a table of orthogonal polynomials; see, for example, Rosenthal & 
Rosnow, 1984a, 1991). Therefore, from the preceding equation we have 

5>jZj __ (3)5.20 + (1)3.72 + (-1).81 + (-3)2.45 _ TL16 _ 

VTV V(3? + (l) 2 + (-‘ij 2 + i-W V20 


as our Z value, which is significant at p = .006, one-tailed. The four p 
values, then, tend to grow linearly more significant as the number of hours 
of contact time increases. 

1I1.B.2. Effect size estimation . Here we want to ask a more focused ques 
tion of a set of effect sizes. For example, given a set of effect sizes for studie 
of peer tutoring, we might want to know whether these effects are increas 
ing or decreasing linearly with the number of hours of contact per month 
We again emphasize r as the effect size estimator but analogous procedures 
are available for comparing such other effect size estimators as Hedges’s 
(1981) g or differences between proportions (d') (Rosenthal & Rubin, 
1982a). These will be described and illustrated shortly. 

As was the case for diffuse tests, we begin by computing the effect sizer, 
its associated Fisher z r , and N - 3, where N is the number of sampling units 
on which each r is based. The statistical significance of the contrast, testing 
any specific hypothesis about the set of effect sizes, can be obtained from a 
Z computed as follows (Rosenthal & Rubin, 1982a): 

— ^ XjZfj is distributed as Z 


In this equation, Aj is the contrast weight determined from some theory for 
any one study, chosen such that the sum of the Aj’s will be zero. The z r . is the 
Fisher z r for any one study and wj is the inverse of the variance of the effect 
size for each study. For Fisher z r transformations of the effect size r, the 
variance is 1 /(Nj — 3) so Wj = Nj — 3. 
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Example^. Studies A, B, C, and D yield effect sizes of r = .89, .76, .23, 
d 59, respectively, all with N = 12. The Fisher z r ’s corresponding to 
Jiese r’s are found from tables of Fisher z r to be 1.42, 1.00, .23, and .68, 
respectively. Suppose that Studies A, B, C, and D had involved differing 
amounts of peer tutor contact such that Studies A, B, C, and D had involved 
o 6 4, and 2 hours of contact per month, respectively. We might, therefore, 
ask whether there was a linear relationship between number of hours of 
contact and size of effect favoring peer tutoring. As in example 19, the 
appropriate weights, or A’s, are 3, 1, -1, and -3. Therefore, from the 
receding equation we have 



(3)1.42 + (1)1.00 + (-Q.23 + (—3).68 

/&? 0? FF F3? 

/ * 4- -+ -H- 

V 9 9 9 9 


VI222 


as our Z value which is significant at p = .022 one-tailed. The four effect 
sizes, therefore, tend to grow linearly larger as the number of hours of con¬ 
tact time increases. Interpretation of this relation must be very cautious. 
After all, studies were not assigned at random to the four conditions of 
contact hours. Generally, variables moderating the magnitude of effects 
found should not be interpreted as giving strong evidence for any causal 
relationships. Moderator relationships can, however, be very valuable in 
suggesting the possibility of causal relationships, possibilities that can then 
be studied experimentally or as nearly experimentally as possible. 

lIl.B.2.a. Other effect size estimates. Although r is our preferred effect 
size estimate in this chapter, analogous procedures are available for such 
other effect size estimates as (Mj — M 2 )/S (Hedges’s g) or the difference 
between proportions (d'). We begin with the case of Hedges’s g. 

Once again we compute the reciprocal (w) of the estimated variance of g 
Q/w) for each study. We employ equation 4.3 when the individual sample 
zes (n } and n 2 ) are known and unequal and equation 4.21 when they are 
nknown or when they are equal. These equations are as follows: 

2 (nm 2 )(ni + n 2 - 2) 

(ni + n 2 )[t 2 + 2(m + n 2 ~ 2)] [4.3] 


N(N - 2) 

W -- 

2(t 2 + 2N -4) [4.21] 

We employ the computed w’s to test the significance of any contrast we 
may wish to investigate. The quantity: 


SIB 
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is distributed approximately as Z 


an equation that is identical in structure to equation 4.27 (Rosenthal & 
bin, 1982a). In this application wj is defined as in equations 4.3 or 4.21 and 
Aj is the contrast weight we assign to the j lh study on the basis of our theory 
The only restriction is that the sum of the V/s must be zero (Rosenthal & 
Rosnow, 1984a, 1985,1991). 

Example 21. Studies A, B, C, and D yield effect sizes of g = 3.56, 2.13 
.43, and 1.33, respectively, all with N = 12. As in example 20, we assume 
8,6,4, and 2 hours of peer tutoring per month were employed in Studies A 
B, C, and D, respectively. We ask whether there was a linear relationship 
between number of hours of contact and size of effect favoring peer tutor¬ 
ing. As in example 20, the appropriate weights, or Vs, are 3,1, -1, and -3, 

Table 4.6 lists the ingredients required to compute our test of signify 
cance (Z) for the contrast and reminds us of the formulas that can be used to 
obtain the various quantities. Now we can apply equation 4.28 to find 
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The four effect sizes of this example were chosen to be the equivalents in 

■ 0 f g to the effect sizes of example 20 which were in units of r. The Z 
based on g is somewhat larger (by 14%) than the Z based on r (2.01) and the 
0 f Oil is somewhat more significant than that for example 20 (p = .022). 
The agreement, therefore, is hardly perfect but it is close enough for practi¬ 
cal purposes. 

a meta-analyst has a favorite effect size estimate, he or she need not 
fear that a different meta-analyst employing a different effect size estimate 
would reach a dramatically different conclusion. However, what should not 
be done is to employ a variety of effect size estimates, perform the various 
nieta-analytic procedures on all of them and report only those results most 
f pleasing to the meta-analyst. There is nothing wrong with employing multi¬ 
ple effect size estimates, but all analyses conducted should also be reported. 
General and special equations showing the relationships between g and r are 
given as equations 2.27, 2.28, 4.22, and 4.23. 

If our effect size estimate were the difference between proportions (d'), 
our procedure would be analogous to that when our effect size estimate was 
Hedges’s g. Once again we compute the reciprocal (w) of the estimated 
variance of d' (1/w) for each study. We employ equation 4.6 when the indi¬ 
vidual sample sizes and n 2 are known and unequal and equation 4.7 when 
they are unknown or when they are equal. The equations are as follows: 


as our Z value which is significant at p = .011 one-tailed. 


n2pi(l - pi) nip2(l - p2) 


TABLE 4.6 

Work Table for Computing Contrasts Among Effect Sizes (g) 


Study 

N 

s a 


t 2 

V 


A# 

w d 

w 

A 

12 

3.56 

6.17 

38.02 

3 

9 

10.68 

1.03 

8.74 

B 

12 

2.13 

3.69 

13.61 

1 

1 

2.13 

1.79 

.56 

C 

12 

.43 

.74 

.55 

-1 

1 

-.43 

2.92 

.34 

D 

12 

1.33 

2.30 

5.31 

-3 

9 

-3.99 

2.37 

3.80 

2 

48 

7.45 

12.90 

57.49 

0 

20 

8.39 

8.11 

13.44 


a. Obtainable from: g ~ (from equation 4.20). 

. .. gVlN. 


b. Obtainable from: t - ■ 


(equation 4.20). 


c. Determined by theory but with 2\ - 0. 

N(N - 2) 

d. Obtainable from: w = 2 ^ 2 + 2N ' - 4)" 


(equation 4.21), 


Once we have w we can test any contrast by means of equation 4.28 
(Rosenthal & Rubin, 1982a) but substituting d' for g: 


= is distributed approximately as Z. 


In this application Wj is defined as in equations 4.6 or 4.7 and Aj is as defined 
above. 

■ Example 22. Studies A, B, C, and D yield effect sizes of d' = .89, .76, 
23, and .59, respectively, all with N = 12. As in example 21, we assume 8, 
6,4, and 2 hours of peer tutoring per month were employed in Studies A, B, 
C, and D, respectively. Again we want to test the linear contrast with Vs of 3, 
1, 1, and —3. Table 4.7 lists the ingredients required to compute our test 

of significance (Z) for the contrast. Now we can apply equation 4.29 to find: 
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2>jd'j 1.43 


as our Z value which is significant at p = .05 one-tailed. 

TABLE 4.7 

Work Table for Computing Contrasts Among Effect Sizes (d') 
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TABLE 4.8 

'ftsts for Linear Contrasts in Effect Sizes Defined as r, g, and d' 



N 

d' 

d' 2 

1-d’ 2 

\ a 

X 2 

kd' 


— 

12 

.89 

.79 

.21 

3 

9 

2.67 

57.14 

.1 

12 

.76 

.58 

.42 

1 

1 

.76 

28.57 

.( 

12 

.23 

.05 

.95 

-1 

1 

-.23 

12.63 

.( 

12 

.59 

.35 

.65 

— 3 

9 

-1.77 

18.46 


18 

2.47 

1.77 

2.23 

0 

20 

1.43 

116.80 



Effect Sizes 

r 

8 

d’ 

.89 

3.56 

.89 

.76 

2.13 

.76 

.23 

.43 

.23 

.59 

1.33 

.59 

.68 

1.73 

.68 

.68* 

1.86 

.62 

2.0l a 

2.29 

1.64 

.022 

.011 

.050 


■ 


a. Determined by theory but with SX ~ 0. 

N 

b. Obtainable from: w= - (equation 4.7). 

1 - d' 2 


The four effect sizes of this example were chosen to be equivalent i 
units of d' to the effect sizes of example 20 (r) and example 21 (g). Tabli 
4.8 summarizes the data for the three effect size estimates of examples 2 
21, and 22. The three Z tests of significance of the linear contrast ar 
somewhat variable, with the Z for the effect size estimator g being abou 
14% larger than that for r and the Z for the effect size estimator d' bein 
about 18% smaller than that for r. However, the range of significance lev 
els is not dramatic with the most significant result at p = .011 and the least 
significant at p = .050. 

Before leaving the topic of focused tests, it should be noted that their use 
is more efficient than the more common procedure of counting each effect 
size or significance level as a single observation (e.g., Eagly & Carli, 1981; 
Hall, 1980; Rosenthal & Rubin, 1978; Smith et al., 1980). In that procedure 
we might, for example, compute a correlation between the Fisher z r values 
and the X’s of example 20 to test the hypothesis of greater effect size being 
associated with greater contact time. Although that r is substantial (.77), it 
does not even approach significance because of the small number of df up 
which the r is based. The procedures employing focused tests or contrasts 
employ much more of the information available and, therefore, are less 
likely to lead to Type II errors. 


i 



Median 08 *•« 08 

fiean !•« -62 

<4 Z (linear contrast) 2.01 2.29 1.64 

- .022 .011 .050 _ 

~ " "’* 11 .. 1 ,mmmm ‘ — 1 

a Based on Fisher’s z r transformation. 

| IH.C. Combining Studies 

lll.C.L Significance testing. After comparing the results of any set of 
three or more studies it is an easy matter also to combine the p levels of the 
set of studies to get an overall estimate of the probability that the set of p 
levels might have been obtained if the null hypothesis of no relationship 
between X and Y were true. Of the various methods available that will be 
described in the next chapter, we present here only the generalized version 
of the method presented earlier in our discussion of combining the results of 
two groups. 

This method requires only that we obtain Z for each of our p levels, all of 
which should be given as one-tailed. Z’s disagreeing in direction from the 
bulk of the findings are given negative signs. Then, the sum of the Z’s di¬ 
vided by the square root of the number (K) of studies yields a new statistic 
distributed as Z. Recapping, 


2 ^ 

—i is distributed as Z ___ 

VK [4.30] 

Should we want to do so, we could weight each of the Z’s by its df, its 
estimated quality, or any other desired weights (Mosteller & Bush, 1954; 
Rosenthal, 1978, 1980). 

The general procedure for weighting Z’s is to multiply each Z by any 
desired weight (assigned before inspection of the data), add the weighted 
Z’s, and divide the sum of the weighted Z’s by the square root of the sum of 
the squared weights as follows: 
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Weighted Z 


2>jZj 

vw 


[4.31] 


Example 24 will illustrate the application of this procedure. 


Example 23 . Studies A, B, C, and D yield one-tailed p values of .15, .05, 
.01, and .001, respectively. Study C, however, shows results opposite in 
direction from the results of the remaining studies. The four Z’s associated 
with these four p’s, then, are 1.04, 1.64, -2.33, and 3.09. From equation 
4.30 we have 

£Zj (1.04) + (1.64) 4- ( — 2.33) 4- (3.09) 

__ = _ = 172 


j e we ighting by quality of research did not lead to a very different 
6X3 it than was obtained when weighting was not employed (example 23); in 
h th cases p = .04 one-tailed. Actually, it might be more accurate to say for 
b ° mnle 23 that weighting was equal with all w’s = 1 than to say that no 
lighting was employed. 

Ill C.2. Effect size estimation. When we combine the results of three or 
more studies we are at least as interested in the combined estimate of the 
effect size as we are in the combined probability. We follow here our earlier 
procedure 0 f considering r as our primary effect size estimator while recog¬ 
nizing that many other estimates are possible. For each of the three or more 
udies to be combined we compute r and the associated Fisher z r and have 


as our new Z value which has an associated p value of .043 one-tailed or .086 
two-tailed. We would normally employ the one-tailed p value if we had 
correctly predicted the bulk of the findings but would employ the two-tailed 
p value if we had not. The combined p that we obtained in this example 
supports the results of the majority of the individual studies. However, even 
if these p values (.043 and .086) were more significant, we would want to be 
very cautious about drawing any simple overall conclusion because of the 
very great heterogeneity of the four p values we were combining. Example 
15, which employed the same p values, showed that this heterogeneity was 
significant at p = .0013. It should be emphasized again, however, that this 
great heterogeneity of p values could be due to heterogeneity of effect sizes, 
heterogeneity of sample sizes, or both. To find out about the sources of 
heterogeneity, we would have to look carefully at the effect sizes and sample 
sizes of each of the studies involved. 

Example 24. Studies A, B, C, and D are those of example 23 just above, 
but now we have decided to weight each study by the mean rating of internal 
validity assigned it by a panel of methodologists. These weights (w) were 
2.4, 2.2, 3.1, and 3.8 for Studies A, B, C, and D, respectively. Employing 
equation 4.31 we find: 


as the Fisher z r corresponding to our mean r (where K refers to the number 
of studies combined). We use a table of Fisher z r to find the r associated with 
our mean z r . Should we want to give greater weight to larger studies we 
could weight each z r by its df, i.e„ N - 3 (Snedecor & Cochran, 1967, 
1980, 1989), by its estimated research quality, or by any other weights 
assigned before inspection of the data. 

The weighted mean z r is obtained as follows: 


1 


S w i z rj 

Weighted * = — [4.33] 

Example 26 will illustrate the application of this procedure. 

Example 25. Studies A, B, C, and D yield effect sizes of r = .70, .45, .10, 
and - .15, respectively. The Fisher z r values corresponding to these r’s are 
.87, .48, .10, and - .15, respectively. Then, from equation 4.32 we have 

2zr (.87) + (.48) + (.10) + (-.15) 


Weighted Z — ■ 


(2.4X1.04) + (2.2)(1.64) + (3.1)(-2.33) + (3.8)(3.09) 
V(2.4) 2 + (2.2) 2 + (3.1) 2 + (3.8) 2 

10.623 

= 1.80 

V34.65 


as the Z of the weighted combined results of Studies A, B, C, and D. The p 
value associated with this Z is .036 one-tailed or .072 two-tailed. In this 


as our mean Fisher z r From our table of Fisher z r values we find a z r of. 32 to 
correspond to an r of .31. Just as in our earlier example of combined p levels, 
however, we would want to be very cautious in our interpretation of this 
combined effect size. If the r’s we have just averaged were based on substan¬ 
tial sample sizes, as was the case in example 16, they would be significantly 
heterogeneous. Therefore averaging without special thought and comment 
would be inappropriate. 
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Example 26. Studies A, B, C, and D are those of example 25 just above 
but now we have decided to weight each study by a mean rating of ecologies 
validity assigned to it by several experts. These weights were 1. 7 ,1.6 ,3 j 
and 2.5 for Studies A, B, C, and D, respectively. Employing equation 4,33 
we find: 


Weighted Z - 


S>jz rj (1.7)(.87) + (1.6)(.48) + (3,1)(.10) + (2.5)(-.15) 

2>j ” 


1.7 + 1.6 + 3.1 + 2.5 


2.182 

8.90 


.24 



as our mean Fisher z r , which corresponds to an r of .24. In this example 
weighting by quality of research led to a somewhat smaller estimate of com¬ 
bined effect size than did equal weighting (.24 versus .31). 

III.C.2m. Other effect size estimates. Any other effect size, e.g., Cohen’s 
d, Hedges’s g, Glass’s A, the difference between proportions (d') and so on 
can be combined with or without weighting just as we have shown forr. The 
only difference is that when we combine r’s we typically transform them to 
Fisher’s z r ’s before combining, while for most other effect size estimates we 
combine them directly without prior transformation. 

EXERCISES 

Six experiments were conducted to investigate the effects of a new treatment 
procedure. The following table shows the effect size (r) obtained in each study and 
the number of patients employed in each study (a positive r means the new treatment 
was better): 


Study 

Effect Size (r) 

N_ 

1 

.64 

43 

2 

.33 

64 

3 

.03 

39 

4 

.02 

46 

5 

— .04 

24 

6 

-.04 

20 




1. Compute the significance level for each of the above studies and give theZ 
associated with each significance level. 

2. Give the weighted and the unweighted mean effect size for these six studies. 

3. Give the significance level associated with each of the two mean effect sizes 
of question 2. 

4. Report and interpret the results of a test of the heterogeneity of these six effect 
sizes. 

5. Test the hypothesis that larger studies obtained larger effect sizes in this set of 
studies. Report the Z, p, and r derived from this contrast. 

6. Convert the effect sizes given above to Cohen’s d or Hedges’s g. Then answer 
questions 1 to 5 for this new effect size estimate. 


Combining Probabilities 


Various methods for combining independent probabilities are described and compared. A 
warning is offered against the direct combining of the raw data of different studies. Fi¬ 
nally, the problem of the “file drawer” is discussed in which studies with null results may 
be unpublished and unretrievable by the meta-analyst. 
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I. GENERAL PROCEDURES 


In the preceding chapter, some basic procedures that can be used to com¬ 
pare and to combine levels of significance and effect size estimates were 
presented. In addition to the basic procedures presented, there are various 
alternative methods available for combining probability levels that are es¬ 
pecially useful under particular circumstances. 

In this section on general procedures we summarize the major methods 
for combining the probabilities obtained from two or more studies testing 
essentially the same directional hypothesis. Although it is possible to do so, 
no consideration is given here to questions of combining results from stud¬ 
ies in which the direction of the results cannot be made immediately appar¬ 
ent, as would be the case for F tests (employed in analysis of variance) with 
df > 1 for the numerator or for chi-square tests (of independence in contin¬ 
gency tables) with df > 1. Although this section is intended to be self- 
contained, it is not intended to serve as a summary of all the useful ideas on 
the topic at hand that are contained in the literature referenced. The seminal 
work of Mosteller and Bush (1954) is especially recommended. For a review 
of the relevant literature see Rosenthal (1978a). 

LA. The Basic Methods 

Table 5.1 presents the results of a set of five illustrative studies. The first 
column of information about the studies lists the results of the t test. The 
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TABLE 5.1 

Summary of Seven Methods for 
Combining Probabilities of Independent Experiments 



I 


Study 

t 

df 

One-tail 

P 

Effect 
Size r 

Z 

_21 °g e p 

;i 

6. 

1 

+ 1.19 

40 

.12 

.18 

+ 1.17 

4.24 



2 

+2.39 

60 

.01 

.29 

+2.33 

9.21 



3 

-0.60 

10 

.72 

-.19 

-0.58 

0.66 



4 

+ 1.52 

30 

.07 

.27 

+ 1.48 

5.32 


I®® 

5 

+0.98 

20 

.17 

.21 

+0.95 

3.54 


jjg ■ 

X 

+5.48 

160 

1.09 

+ .76 

+ 5.35 

22.97 


7. 

Mean 

+ 1.10 

32 

.22 

+ .15 

+ 1.07 

4.59 


pyf’C' 

Median 

+ 1.19 

30 

.12 

+ .21 

+ 1.17 

4.24 


P 


NOTES: The seven methods follow. 

1. Method of Add ing Logs: 

X*(df = 2N) = X - 2 log e p = 22.97 
p = .011 one-tail 

2. Method of Adding Probabilities (Applicable when Xp near unity or less): 


(Sp) 1 


N 


N! 


(1.09) 5 

sT 


3. Method of Adding t's: 


Xt 


- = .013 one-tail 


5.48 


VSdf/(df ~ 2)] 


V40/38 + 60/58 + 10/8+ 30/28 + 20/18 
5.48 


[5.3] 


4. Method of Adding Z’s: 


V5.5197 
p - .01 one-tail. 

XZ 


1= 2.33, 


= ^ 5.35 
VFT ~ VsT 

p = .009 one-tail 

5, Method of Adding Weighted Z’s: 

T df 1 Z 1 + df 2 z 2 + • + df n Z n 


2.39, 


Oj Vdf^ T UIg". 

_ (40)(+1.17) + (60H+2.33) + ... + (20)(+0.95) 
V{40) 2 -I- (60) z + ... + (20) 2 


+ df 0 2 + ....+ df n 


[5.4] 

: ' 

v* 

: ,a 
• •• 
a 

[5.5] 
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244.2 

= —= 3.01, 

V6600 

p = .0013 one-tail 
hod of Testing Mean p: 

Z = (.50-p)(Vl2N ) 

= (.50 - .22)Vl2(5j = 2.17, 
p = .015 one-tail 
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[5.6] 


t = ' 


XZ/N 


+ 1.07 


VS 2 (Z) /N 


2.26, df = 4, 


\L22513 

p < .05 one-tail 


[5.7] 


(Zzy 

F =-7TTT^-= 5.09, df = 1, 4, 


(N) S’ 


(Z) 


, p < .05 one-tail 


a: 


sign preceding t gives the direction of the results; a positive sign means the 
difference is consistent with the bulk of the results, a negative sign means 
the difference is inconsistent. The second column records the df upon which 
each t was based. 

The third column gives the one-tailed p associated with each t. It should 
be noted that one-tail p’s are always less than .50 when the results are in the 
consistent direction, but they are always greater than .50 when the results 
are not consistent. For example, study 3 with a t of — .60 is tabulated with a 
one-tail p of .72. If the t had been in the consistent direction, i.e., + .60, the 
one-tail p would have been .28. It is important to note that it is the direction 
of difference which is found to occur on the average that is assigned the + 
sign, and hence the lower one-tail p. The basic computations and results are 
identical whether we were very clever and predicted the net direction of 
effect or not clever at all and got it quite wrong. At the very end of our 
calculations, we can double the final overall level of significance if we want 
to make an allowance for not having predicted the net direction of effect. 

The fourth column of the table gives the size of the effect defined in 
terms of the Pearson r. 

The fifth column gives the standard normal deviate, or Z associated with 
each p value. The final column of our table lists the natural logarithms of the 
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one-tail p’s (of the third column of information) multiplied by - 2 . Each is a 
quantity distributed as x 2 with 2 df and is an ingredient of the first method of 
combining p levels to be presented in this section (Fisher, 1932, 1938). 

IA.L Adding logs . The last column of our table is really a list of x 2 val¬ 
ues. The sum of independent x 2 ’s is also distributed as x 2 with df equal to 
the sum of the df’s of the x 2 ’s added.Therefore, we need only add the five 
X 2 ’s of our table and look up this new x 2 with 5 X 2 = 10 df. The results are 
given just below the row of medians of our table; x 2 = 22.97, which is 
associated with a p of .011, one-tail, when df = 10. 

The method of adding logs, sometimes called the Fisher method, though 
frequently cited, suffers from the disadvantage that it can yield results that 
are inconsistent with such simple overall tests as the sign test of the null 
hypothesis of a 50:50 split (Siegel, 1956). Thus for a large number of stud¬ 
ies, if the vast majority showed results in one direction, we could easily 
reject the null hypothesis by the sign test even if the consistent p values were 
not very much below .50. However, under these situations the Fisher 
method would not yield an overall significant p (Mosteller & Bush, 1954). 
Another problem with the Fisher method is that if two studies with equally 
and strongly significant results in opposite directions are obtained, the 
Fisher method supports the significance of either outcome! Thus p’s of .001 , 
forA > B and .001 for B > A combine to a p < .01 for A > B or B > A 
(Adcock, 1960). Despite these limitations, the Fisher method remains the 
best known and most discussed of all the methods of combining indepen¬ 
dent probabilities (see Rosenthal, 1978 for a review of the literature). Be¬ 
cause of its limitations, however, routine use does not appear indicated. 

-~-= =/jf 

I.A.2. Adding probabilities . A powerful method has been described by 
Edgington (1972a) in which the combined probability emerges when the 
sum of the observed p levels is raised to the power equivalent to the number 
of studies being combined (N) and divided by N!. Essentially, this formula 
gives the area of a right triangle when the results of two studies are being 
combined, the volume of a pyramid when the results of three studies are 
combined, and the n-dimensional generalization of this volume when more 
studies are involved. Our table shows the results to be equivalent to those 
obtained by the Fisher method for this set of data. The basic Edgington 
method is useful and ingenious but is limited to small sets of studies, since it 
requires that the sum of the p levels not exceed unity by very much. When 
the sum of the p levels does exceed unity, the overall p obtained tends to be 
too conservative unless special corrections are introduced. 

I.A.3. Adding t’s. A method that has none of the disadvantages of the pre¬ 
ceding two methods was described by Winer (1971). Based on the result that 
the variance of the t distribution for any given df is df/(df - 2 ), it requij 


p r f 
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adding the obtained t values and dividing that sum by the square root of the 
m of the df s associated with the t’s after each df has been divided by df - 2 . 
The result of the calculation is itself approximately a standard normal 
deviate that is associated with a particular probability level when each of 
the t’s is based on df of at least 10 or so. When applied to the data of our table, 
the Winer method yields p = .01, one-tail, a result very close to the earlier 
two results. The limitation of this method is that it cannot be employed when 
the size of the samples for which t is computed becomes less than three, 
because that would involve dividing by zero or by a negative value. In 
addition, the method may not give such good approximations to the normal 
with df < 10 for each t. 

I.A.4. Adding Z’s. Perhaps the simplest of all, the Stouffer method de¬ 
scribed in the last chapter (Mosteller & Bush, 1954) asks us only to add the 
standard normal deviates (or Z’s) associated with the p’s obtained, and divide 
by the square root of the number of studies being combined (Adcock, 1960; 
Cochran, 1954; Stouffer, Suchman, DeVinney, Star, & Williams, 1949, p. 45). 
Each Z was a standard normal deviate under the null hypothesis. The variance 
of the sum of independent normal deviates is the sum of their variances. Here, 
this sum is equal to the number of studies, since each study has unit variance. 
Our table shows results for the Stouffer method that are very close to those 
obtained by the method of adding t’s (Z = 2.39 vs. Z = 2.33). 

I.A.5. Adding weighted Z's. Mosteller and Bush (1954) have suggested a 
technique that permits us to weight each standard normal deviate by the size 
of the sample on which it is based (or by its df), or by any other desirable 
positive weighting such as the elegance, internal validity, or real-life repre¬ 
sentativeness (ecological validity) of the individual study. The method, il¬ 
lustrated in the last chapter, requires us to add the products of our weights 
and Z’s, and to divide this sum by the square root of the sum of the squared 
weights. Our table shows the results of the application of the weighted 
Stouffer method with df employed as weights. We note that the result is the 
lowest overall p we have seen. This is because, for the example, the lowest p 
levels are given the heaviest weighting because they are associated with the 
largest sample sizes and df. Lancaster (1961) has noted that when weighting 
is employed, the Z method is preferable to weighting applied to the Fisher 
method for reasons of computational convenience and because the final sum 
obtained is again a normal variable. Finally, for the very special case of just 
two studies, Zelen and Joel (1959) describe the choice of weights to mini¬ 
mize type II errors. 

LA.6. Testing the mean p. Edgington (1972b) has proposed a normal 
curve method to be used when there are four or more studies to be com¬ 
bined. The mean of the p’s to be combined is subtracted from .50, and this 
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quantity is multiplied by the square root of 12N, where N is the number of 
studies to be combined. (The presence of a 12 derives from the fact that the 
variance of the population of p values is 1/12, when the null hypothesis of no 
treatment effects is true.) 

I.A.7. Testing the mean Z. In this modification of the Stouffer method 
Mosteller and Bush (1954) first convert p levels to Z values. They then com¬ 
pute a t-test on the mean Z value obtained with the df for t equal to the number 
of Z values available minus one. Mosteller and Bush, however, advise against 
this procedure when there are fewer than five studies to be combined. That 
suggestion grows out of the low power of the t test when based on few obser¬ 
vations. Our table illustrates this low power by showing that this method 
yields the largest combined p of any of the methods reviewed. 

I.B. Additional Methods 

I.B.l. Counting. When the number of studies to be combined grows 
large, a number of counting methods can be employed (Brozek & Tiede, 
1952; Jones & Fiske, 1953; Wilkinson, 1951). The number of p values below 
.50 can be called +, the number of p values above .50 can be called —, and a 
sign test can be performed. If 12 of 15 results are consistent in either direc¬ 
tion, the sign test tells us that results so rare “occur by chance” only 3.6%of 
the time. This procedure, and the closely related one that follows, have been 
employed by Hall (1979, 1984). 

The x 2 statistic may also be useful in comparing the number of studies 
expected to reach a given level of significance under the null hypothesis 
with the number actually reaching that level (Rosenthal, 1969, 1976; Ro¬ 
senthal & Rosnow, 1975; Rosenthal & Rubin, 1978a). In this application 
there are two cells in our table of counts, one for the number reaching some 
critical level of p, the other for the number not reaching that critical level of 
p. When there are 100 or more studies available, we can set our critical p 
level at .05. Our expected frequency for that cell is .05N while our expected 
frequency for the other cell is .95N.For example, suppose that 12 of 120 
studies show results at p ^ .05 in the same direction. Then our expected 
frequencies for the two cells are .05(120) and .95(120) respectively, as 
shown in Table 5.2. 

It is not necessary to set our critical value of p at .05. We could as well use 
,10 or .01. However, it is advisable to keep the expected frequency of our 
smaller cell at 5 or above. Therefore, we would not use a critical value of .01 
unless we had at least 500 studies altogether. To keep our smaller expected 
frequency at 5 or more we would use a critical level of. 10 if we had 50 studies, 
a critical level of .20 if we had 25 studies, and so on. More generally, whe; 
there are fewer than 100 studies but more than 9, we enter in one cell an 
expected frequency of 5 and in the other an expected frequency of N-5. Th 
observed frequency for the first cell, then, is the number of studies reaching 
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TABLE 5.2 

Counting Method for Assessing Overall 
Significance of a Relationship (x 2 Method) 


Counts 

Studies Reaching 
p ^ .05 

Studies Not 
Reachingp ^ .05 

2 

Obtained 

12 

108 

120 

Expected 

(if null hypothesis true) 

6 a 

114 b 

120 




(12 - 6) 2 (108 - 114) 2 

6 + 114 


6.32, p = .012 


3 Z = VxT!), Z = V6.32 = 2.51, p = .006,. 



a. Computed from .05(N) = .05(120) = 6. 

b. Computed from .95(N) = .95(120) = 114. 


p < The observed frequency for the second cell is the number of studies 
with p > n The resulting x 2 can then be entered into a table of critical x 2 
values. Alternatively, the square root of x 2 can be computed to yield Z, the 
standard normal deviate. Although clear-cut results on the issue are not avail¬ 
able, it appears likely that the counting methods are not as powerful as other 
methods described here. 

I.B.2. Blocking. The last method, adapted from the procedure given by 
Snedecor and Cochran (1967; see also Cochran & Cox, 1957) requires that 
we reconstruct the means, sample sizes, and mean square within conditions 
for each of our studies. We then combine the data into an overall analysis of 
variance in which treatment condition is the main effect of primary interest 
and in which studies are regarded as a blocking variable. If required because 
of differences among the studies in their means and variances, the depen¬ 
dent variables of the studies can be put onto a common scale (e.g., zero 
mean and unit variance). 

When studies are assumed to be a fixed factor, as they sometimes are 
(Cochran & Cox, 1957), or when the MS for treatments x studies is small 
relative to the MS within, the treatment effect is tested against the pooled 
MS within (Cochran & Cox, 1957). When the studies are regarded as a 
random factor and when the MS for treatments x studies is substantial rela¬ 
tive to the MS within (say, F > 2), the treatments x studies effect is the 
appropriate error term for the treatment effect. Regardless of whether stud¬ 
ies are viewed as fixed or random factors, the main effect of studies and the 
interaction of treatments x studies are tested against the MS within. 

Substantial main effects of studies may or may not be of much interest, 
but substantial treatments x studies interaction effects will usually be of 
considerable interest. It will be instructive to study the residuals defining 
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TABLE 5A 

Advantages and Limitations of Nine Methods 
of Combining Probabilities 


Method 

Advantages 

Limitations 

Use When 

1. Adding 

Logs 

Well-established 

historically 

Cumulates poorly; 
can support opposite 
conclusions. 

N of studies small 

(« 5) 

2. Adding p’s 

Good power 

Inapplicable when N 
of studies (or p’s) 
large unless complex 
corrections are intro¬ 
duced. 

N of studies small (X p 

1.0) 

3. Adding t’s 

Unaffected by N of 
studies given mini¬ 
mum df per study 

Inapplicable when t’s 
based on very few df. 

Studies not based on 
too few df 

4. Adding 

Z’s 

Routinely applica¬ 
ble; simple 

Assumes unit vari¬ 
ance when under 
some conditions Type 

I or Type II errors 
may be increased. 

Anytime 

5. Adding 
Weighted 

Z’s 

Routinely applica¬ 
ble, permits 
weighting 

Assumes unit vari¬ 
ance when under 
some conditions Type 

I or Type II errors 
may be increased. 

Whenever weighting 
desired 

6. Testing 

Mean p 

Simple 

N of studies should 
not be less than four. 

N of studies 2 s 4 

7. Testing 

Mean Z 

No assumption of 
unit variance 

Low power when N 
of studies small. 

N of studies 2* 5 

8. Counting 

Simple and robust 

Large N of studies 
needed; may be low 
in power. 

N of studies large 

9. Blocking 

Displays all means 
for inspection, thus 
facilitating search 
for moderators 
(variables altering 
the relationship be¬ 
tween independent 
and dependent 
variables). 

Laborious when N 
large; insufficient 
data may be available 
to employ this 
procedure. 

N of studies not too 
large 
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fnethod of adding Z’s combined with one or more of the counting methods 
as a check. Practical experience with the various methods suggests that 
there is only rarely a serious discrepancy among appropriately chosen 
methods. It goes without saying, of course, that any overall p that has been 
computed (or its associated test statistic with df) should be reported and not 
suppressed for being higher or lower than the investigator might like. 

To make possible the computations described in this chapter, authors 
should routinely report the exact t, F, Z, or other test statistic along with its 
dfor N, rather than simply making such vague statements as “t was signifi¬ 
cant at p ^ .05. 

Reporting the test statistic along with an approximate p level also seems 
preferable to reporting the “exact” p level for three reasons: (1) the exact p 
level may be difficult to determine without a computer or a calculator that 
stores such distributions as Z, t, F, and x 2 , (2) ambiguity about one-tail versus 
two-tail usage is avoided, and (3) the test statistic allows us to compute exact 
p as well as the effect size. Speaking of effect size, we encourage editors to 
routinely require the report of an effect size (e.g., r, g, A, or d) for every test 
statistic reported. 

Finally, it should be noted that even if we have established a low com¬ 
bined p, we have said absolutely nothing about the typical size of the effect 
the “existence” of which we have been examining. We owe it to our readers 
to give for each combined p estimate an estimate of the probable size of the 
effect in terms of a correlation coefficient, a cr unit, or some other estimate. 
This estimated effect size should be accompanied, when possible, by a con¬ 
fidence interval. 

LDt On Not Combining Raw Data 

Sometimes it happens that the raw data of two or more studies are availa¬ 
ble. We have seen how these data could be appropriately combined in the 
method of blocking. There may, however, be a temptation to combine the 
raw data without first blocking or subdividing the data on the basis of the 
studies producing the data. The purpose of this section is to help avoid that 
temptation by showing the very misleading or even paradoxical results that 
can occur when raw data are pooled without blocking. 

Table 5.5 shows the results of four studies in which the correlation be¬ 
tween variables X and Y is shown for two subjects. The number of subjects 
per study makes no difference and the small number (n = 2 ) is employed here 
only to keep the example simple. For each of the four studies the correlation 
(r) between X and Y is —1.00. However, no matter how we combine the raw 
data of these four studies, the correlation is never negative again. Indeed, the 
range of r’s is from zero (as when we pool the data between any two adjacent 
studies) to . 80 (as when we pool the data from studies 1 and 4). The remainder 
of Table 5.5 shows the six different correlations that are possible (. 00 , .45, 
60, .67, .72, .80) as a function of which studies are pooled. 
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TABLE 5.5 

Effects of Pooling Raw Data: Four Studies 


TABLE 5.6 

Effects of Pooling Tables of Counts 



Study 1 

Study 2 

Study 3 

Study 4 


X Y 

X 

Y 

X 

Y 

X Y 

Subject 1 

2 0 

4 

2 

6 

4 

8 ~6 ~~ 

Subject 2 

0 2 

2 

4 

4 

6 

6 8 

Mean 

1.0 1.0 

3.0 

3.0 

5.0 

5.0 

p 

r-' 

o 

r 

-1.00 

-1.00 

- 

1.00 

-1.00 

Correlations obtained 





Three or 

when pooling: 

Two Studies 


Four Studies 


r - 

.00 

.60 

.80 

A5 

■67 — 


Pooled studies: 

1+2 

1 + 3 

1+4 

1,2,3 

1.2,3,4 1,2,4 



2+3 

2+4 


2,3,4 

1,3,4 



3+4 






How can these anomalous results be explained? Examination of the 
means of the X and Y variables for the four studies of Table 5.5 helps us 
understand. The means of the X and Y variables differ substantially from 
study to study and are substantially positively correlated. Thus, in study 1 
the X and Y scores (although perfectly negatively correlated) are all quite 
low relative to the X and Y scores of study 4 which are all quite high (al¬ 
though also perfectly negatively correlated). Thus, across these studies in 
which the variation is substantial, we have an overall positive correlation 
between variables X and Y. Within these studies, where the correlations are 
negative (—1.00) the variation in scores is relatively small, small enough to 
be swamped by the variation between studies. 

Although there may be times when it is useful to array the data from 
multiple studies in order to see an overall pattern of results, or to see what 
might happen if we planned a single study with variation equivalent to that 
shown by a set of pooled studies, Table 5.5 serves as serious warning of how 
pooled raw data can lead to conclusions (though not necessarily “ wrong’’) 
opposite to those obtained from individual, less variable studies. 

I.D.l. Yule's or Simpson's Paradox. Nearly a century ago G. Udny Yule 
(1903) described a related problem in dealing with 2x2 tables of counts. He 
showed how two studies in which no relationship (r = .00) was found between 
the variables defined by the two rows and the two columns, could yield a 
positive correlation (r = .19) when the raw data were pooled. Similarly, 
Simpson (1951) showed how two studies with modest positive correlations 
(r’s = .03 and .04) could yield a zero correlation when the raw data were 
pooled. Table 5.6 illustrates the problem described by Yule (1903), by 
Simpson (1951) and by others (e.g., Birch, 1963; Blyth, 1972; Fienberg, 
1977; Glass et al„ 1981; and Upton, 1978). 



Study 1 

Study 2 

Pooled 


Alive Dead 

Alive Dead 

Alive Dead 

Example I 

Treatment 

100 1000 

100 10 

200 1010 

Control 

10 100 

1000 100 

1010 200 

£ 

110 1100 

1100 110 

1210 1210 


r = 0 

r = 0 

r — .67 

Example 11 

Treatment 

50 100 

50 0 

100 100 

Control 

0 50 

100 50 

100 100 

1 

50 150 

150 50 

200 200 


r = .33 

r = .33 

r = 0 


In Example I of Table 5.6 we see two studies showing zero correlation 
between the treatment condition and the outcome. When the raw data of these 
two studies are pooled, however, we find a dramatic correlation of .67 sug¬ 
gesting that the treatment was harmful. Note that in Study 1 only 9% of pa¬ 
tients survived, while 91 %received the treatment, whereas in Study 2, 91 %of 
patients survived but only 9 % received the treatment. It is these inequalities 
of row and column totals that lead to Yule’s (or Simpson’s) Paradoxes. 

Example II of Table 5.6 shows two studies each obtaining a strong effect 
favoring the treatment condition (r = .33). When these two studies were 
pooled, however, these strong effects vanished. Note that in Study 1 only 
25% of patients survived while 75% received the treatment, whereas in 
Study 2, 75 %of patients survived but only 25 % received the treatment. Had 
row and column totals been equal, the paradoxes of pooling would not have 
occurred. 

The moral of the pooling paradoxes is clear. Except for the exploratory 
purposes mentioned earlier, raw data should not be pooled without block¬ 
ing. In most cases, effect sizes and significance levels should be computed 
separately for each study and only then combined. 
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II. SPECIAL ISSUES 


Earlier in this chapter we saw that the method of adding Z’s was perhaps 
the most generally serviceable method for combining probabilities. In the 
following section we provide procedures facilitating the use of this method. 
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II. A. Obtaining the Value of Z 

The method of adding Z’s requires that we begin by converting the ob- 
tained one-tailed p level of each study to its equivalent Z. The value of Z is 
zero when the one-tailed p is .50, positive as p decreases from p = .50 to p 
close to zero, and negative as p increases from .50 to p close to unity. Thus a 
one-tailed p of .01 has an associated Z of 2.33, while a one-tailed p of ,99 
has an associated Z of -2.33. These values can be located in the table of 
probabilities associated with observed values of Z in the normal distribu¬ 
tion found in most textbooks on statistics. 

Unfortunately for the meta-analyst, few studies report the Z associated 
with their obtained p. Worse still, the obtained p’s are often given impre¬ 
cisely as < .05or< .01, so that p might be .001 or .0001 or .00001. If pis all 
that is given in a study, all we can do is use a table of the normal distribution 
to find the Z associated with a reported p. Thus, one-tailed p’s of .05, .01, 
and .001 are found to have associated Z’s of 1.65, 2.33, and 3.09, respec¬ 
tively. (If a result is simply called “nonsignificant,” and if no further infor¬ 
mation is available, we have little choice but to treat the result as a p of .50 
Z = 0.00.) 

Since p’s reported in research papers tend to be imprecisely reported we 
can do a better job of combining p’s by going back to the original test statistics 
employed, e.g., t, F, or x 2 . Fortunately, many journals require that these statis¬ 
tics, along with their df, be reported routinely. The df for t and for the denomi¬ 
nator of the F test in analysis of variance tell us about the size of the study. The 
df for x 2 is analogous to the df for the numerator of the F test in analysis of 
variance and so tells us about the number of conditions, not the number of 
sampling units. Fortunately, the 1983 edition of the Publication Manual of the 
American Psychological Association has added a requirement that when re¬ 
porting x 2 test statistics the total N be given along with the df. 


II.A.l . Test statistics. If a t test was employed, we can use a t table to find 
the Z associated with the obtained t. Suppose t(20) = 2.09 so that p — .025, 
one-tailed. We enter the t table at the row for df — 20 and read across to the t 
value of 2.09. Then we read down the column to the entry for df = oo, which 
is the Z identical to the value of t with 00 df (1.96). Suppose, however, that 
our t was 12.00 with 144 df. Even extended tables of t cannot help us when 
values of t are that large for substantial df (Federighi, 1959; Rosenthal & 
Rosnow, 1984a, 1991). A very accurate estimation of Z from t is available 
for such circumstances (Wallace, 1959): 
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A very useful and conservative approximation to this formula is also availa¬ 
ble (Rosenthal & Rubin, 1979a): 


This approximation works best when t 2 < df; when t 2 = df, this approxima¬ 
tion tends to be 10 % smaller than the Z obtained from equation 5.14. 

If the test statistic employed was F (from analysis of variance) and df for 
the numerator was unity, we take the VFas t and proceed as we did in the 
case of t with df equal to the df of the denominator of the F ratio. We should 
note that F ratios of df > 1 in the numerator cannot be used in combining p 
levels to address a directional hypothesis. 

If the test statistic employed was x 2 (for independence in contingency 
tables) with df = 1, we take Vx 2 directly, since x 2 (l) = Z 2 . We should note 
that x 2 ’ s of df > 1 cannot be used in combining p levels to address a direc¬ 
tional hypothesis. _ 

When VF or V x 2 is employed we must be sure that Z is given the appro¬ 
priate sign to indicate the direction of the effect. 


ILA.2 . Effect size estimates. Sometimes we want to find Z for a study in 
which no test statistic is given (e.g., t, F, x 2 )> but an effect size estimator such 
as r (including point biserial r and phi), g, A, or d is given along with a rough 
p level indicator such as p < .05. In those cases we can often get a service¬ 
able direct approximation of Z by using the fact that (phi ) 2 = x 2 (l)/N so that 
N(phi ) 2 = x 2 ( 1 ) and VFT (phi) = Vx 2 (l) = Z. 

In the case of r or point biserial r, multiplying by Vn will yield a gener- 
ally conservative approximation to Z. A more accurate value can be ob¬ 
tained by solving for t in the equation: 


VTT* 


x V df or. 


x V N-2 


and then employing t to estimate Z as shown in equations 5.14 or 5.15. 

We will not review here how to get t from the other effect size estimators 
but that information is found in equations 2.3 - 2.13 (Tables 2.1 and 2.2), 
4.19, and 4.20. 

II. B. The File Drawer Problem 

Statisticians and behavioral researchers have long suspected that the 
studies published in the behavioral and social sciences are a biased sample 
of the studies that are actually carried out (Bakan, 1967; McNemar, 1960; 
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Smart, 1964; Sterling, 1959). The extreme view of this problem, the fii e 
drawer problem, is that the journals are filled with the 5% of the studies that 
show type I errors, while the file drawers back at the lab are filled with the 
95% of the studies that show nonsignificant (e.g. , p > .05) results (Rosen¬ 
thal, 1979a; Rosenthal & Rubin, 1988). 

In the past, there was very little we could do to assess the net effect of 
studies tucked away in file drawers that did not make the magic .05 level 
(Rosenthal & Gaito, 1963,1964; Nelson, Rosenthal, & Rosnow, 1986). Now, 
however, although no definitive solution to the problem is available, we can 
establish reasonable boundaries on the problem and estimate the degree of 
damage to any research conclusion that could be done by the file drawer 
problem. The fundamental idea in coping with the file drawer problem is 
simply to calculate the number of studies averaging null results (Z = 0.00) 
that must be in the file drawers before the overall probability of a type I error 
can be just brought to any desired level of significance, say p = .05. This 
number of filed studies, or the tolerance for future null results, is then 
evaluated for whether such a tolerance level is small enough to threaten the 
overall conclusion drawn by the reviewer. If the overall level of significance 
of the research review will be brought down to the level of just significant 
by the addition of just a few more null results, the finding is not resistant to 
the file drawer threat. (For a more technical discussion of the underpinnings 
of the following computations see Rosenthal & Rubin, 1988). 

IIB.L Computation. To find the number (X) of new, filed, or unretrieved 
studies averaging null results required to bring the new overall p to any 
desired level, say, just significant at p = .05 (Z = 1.645), one simply writes: 


where K is the number of studies combined and Z is the mean Z obtained for 
the K studies. 

Rearrangement shows that 

K[KZ 2 - 2.706J 


An alternative formula that may be more convenient when the sum of tl 
Z’s (£Z) is given rather than the mean Z, is as follows: 


One method based on counting rather than adding Z’s may be easier to 
compute and can be employed when exact p levels are not available, but it is 
probably less powerful. If X is the number of new studies required to bring 
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| e overall p to .50 (not to .05), s is the number of summarized studies 
s ignifi cant at P < -05 and n is the number of summarized studies not signifi- 
cant at .05, then 


I w here 19 is the ratio of the total number of nonsignificant (at p > .05) 
results to the number of significant (at p < .05) results expected when the 
null hypothesis is true. 

Another conservative alternative when exact p levels are not available is 
to set Z= .00 for any nonsignificant result and to set Z - 1.645 for any 
result significant at p < .05. 

The equations above all assume that each of the K studies is independent 
of all other K - 1 studies, at least in the sense of employing different sam¬ 
pling units. There are other senses of independence, however; for example, 
we can think of two or more studies conducted in a given laboratory as less 
independent than two or more studies conducted in different laboratories. 
Such nonindependence can be assessed by such procedures as intraclass 
correlations. Whether nonindependence of this type serves to increase type 
fc lor type II errors appears to depend in part on the relative magnitude of the 
Z’s obtained from the studies that are “correlated” or “too similar.” If the 
correlated Z’s are on the average as high as or higher than the grand mean Z 
corrected for nonindependence, the combined Z we compute treating all 
studies as independent will be too large, leading to an increase in type I 
errors. If the correlated Z’s are on the average clearly low relative to the 
grand mean Z corrected for nonindependence, the combined Z we compute 
treating all studies as independent will tend to be too small, leading to an 
increase in type II errors. 

II.B.2. Illustration. In 1969, 94 experiments examining the effects of in¬ 
terpersonal self-fulfilling prophecies were summarized (Rosenthal, 1969). 
The mean Z of these studies was 1.014, K was 94, and Z for the studies 
combined was: 

2Z KZ 94(1.014) 


How many new, filed, or unretrieved studies (X) would be required to 
bring this very large Z down to a barely significant level (Z = 1.645)? From 
equation 5.17 of the preceding section: 


K[KZ 2 - 2.706] 


94[94( 1.014) 2 - 2.706] 
2.706 


One finds that 3,263 studies averaging null results (Z = .00) must be 
crammed into file drawers before one would conclude that the overall 
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results were due to sampling bias in the studies summarized by th e 
viewer. In a more recent summary of the same area of research (Rosenth 
& Rubin, 1978) the mean Z of 345 studies was 1.22, K was 345, and X w* 
65,123. In a still more recent summary of the same area of research, the mea 
Z was 1.30, K was 443, and X was 122,778. Thus over 120,000 unreporto 
studies averaging a null result would have to exist somewhere before th, 
overall results could reasonably be ascribed to sampling bias. 


s 


ILB.3. Guidelines for a tolerance level At the present time, no fix m 
guidelines can be given as to what constitutes an unlikely number of unre¬ 
trieved and/or unpublished studies. For some areas of research 100 or even 
500 unpublished and unretrieved studies may be a plausible state of affairs 
while for others even 10 or 20 seems unlikely. Probably any rough and ready 
guide should be based partly on K so that as more studies are known it 
becomes more plausible that other studies in that area may be in those file 
drawers. Perhaps we could regard as robust to the file drawer problem any 
combined results for which the tolerance level (X) reaches 5 K + 10. Tha 
seems a conservative but reasonable tolerance level; the 5 K portion sug- 
gests that it is unlikely that the file drawers have more than five times a 
many studies as the reviewer, and the +10 sets the minimum number o 
studies that could be filed away at 15 (when K = 1). 

It appears that more and more reviewers of research literatures will b 
estimating average effect sizes and combined p’s of the studies they summa¬ 
rize. It would be very helpful to readers if for each combined p they pi 
sented, reviewers also gave the tolerance for future null results associated 
with their overall significance level. 


II.B A, Empirical estimates of the magnitude of the file drawer probh 
In chapter 3 section I.A.2 we examined the differences in effect sizes obtaineu 
from such information sources as journal articles, books, theses, and unpub¬ 
lished materials. There we saw that there was no clear difference in typical 
effect size obtained in studies that were published in journals versus studies 
that were not yet published. In this section our emphasis will be on signifi¬ 
cance testing rather than on effect size estimation, and we will try to get some 
reasonable estimates of the magnitude of the file drawer problem. 

To begin with, there seems to be little doubt that the statistical significance 
of a result is positively associated with its being published. In a series of six 
studies, for example, the range of these correlations was from .20 to .42 with 
a median r of .33. This result is roughly equivalent to two thirds of significant 
results being published while only one third of nonsignificant results 
published in a population of studies in which about half are published am 
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half are not and in which about half are significant and half are not. (The six 
tudies on which this analysis is based are Atkinson, Furlong, & Wampold, 
1982 ; Blackmore, 1980; Chan, Sacks, & Chalmers, 1982; Coursol & Wagner, 
1086 ; Simes, 1987; Sommer, 1987.) 


JIB Am. Estimating retrieval bias . Several studies have been conducted 
to try to estimate the number of studies that might be languishing in the 
file drawers. Shadish, Doherty, and Montgomery (1989) took a simple ran¬ 
dom sample of 519 possible investigators from a population of 14,002 
uiarital/family therapy professionals. Responses concerning their research 
were obtained from 375 (72%) of the sample, and this yielded only 3 studies 
(J= .34) that could have been included in a meta-analysis. Shadish, Doherty, 
and Montgomery concluded tentatively that the file drawers might contain 
about 112 (14,002 x 3/375) studies, which is not quite as many as they had 
retrieved for their ongoing meta-analysis (about 165). 

In a survey of all members of a population of researchers, Sommer (1987) 
wrote to all 140 members of the Society for Menstrual Cycle Research. Based 
on a response rate of 65%, Sommer found little publication bias. Of the 73 
published studies, 30% were significant in the predicted direction; of the 42 

! studies in the publication pipeline, 38% were significant in the predicted 
direction; and of the the 28 studies securely filed away, 29% were significant. 
When only those studies were considered for which significance testing data 
were available, the corresponding percentages were 61 %, 76%, and 40%. An 
interesting sidelight of Sommer’s study was that far and away the best 
predictor of publication status of the article was the productivity of the author. 

Employing a different approach, Rosenthal and Rubin (1988) compared 
the meta-analytic results for a research domain for the case of complete 
retrieval with the results for that same research domain for the more typical 
case of incomplete retrieval. 



II.BA.b. Complete versus incomplete retrieval For an earlier meta¬ 
analysis of 103 studies of interpersonal expectancy effects, studies could be 
divided into one group in which all could be retrieved because they were all 
conducted in a single laboratory (Rosenthal, 1969) and a second group of 
retrieved studies conducted elsewhere. Table 5.7 shows the mean Z obtained 
for each of these two sets of studies subdivided by whether the study (a) had 
been published at the time of the original (1969) meta-analysis, (b) had been 
unpublished at the time of the meta-analysis but was published by the time 
of the present analysis (1990), or (c) had been unpublished at the time of the 
meta-analysis and remained unpublished in 1990. 
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TABLE 5.7 

Mean Z’s in Two Conditions of Retrievability 



Retrieval _ Published _ 

Complete 1.08 2 ° a 

Incomplete 2.60 10 

Mean _ 1.58 30 _ 

a. Number of studies on which Z is based. 


Publication Status 

Published 

Later 


Never 

Published 


Analysis of variance of the 103 studies’ Z’s cast into the 2 x 3 table showed 
the interaction (F(2, 97) = 0.49, p = .614) to be sufficiently small that the 
following contrasts tell the story. The comparison of retrievability yielded 
t(97) = 2.92, p = .0022, r = .17; the comparison of studies ever published 
with those never published yielded t(97) = 0.24, r = .05; the comparison of 
studies published at the time of the meta-analysis and those published later 
yielded t(97) = 2.52, p = .0077, r = .16; and the comparison of studies 
published versus not published at the time of the meta-analysis yielded t(97) 
= 2.32, p = .011, r = .15. 

Thus, the completely retrieved data meta-analysis yielded less significant 
results on average than did the incompletely retrieved data meta-analysis, 
That result fits our suspicions that completely retrieved data show less 
significant results than do less completely retrieved data. However, in this 
table, retrievability is fully confounded with production by a particular 
laboratory, which might differ in various ways from the remaining labora¬ 
tories producing results bearing on the same research question. 

The publication status effects are more surprising. As expected, published 
results were more significant than initially unpublished results. However, of 
the initially unpublished studies, those published eventually were less signif¬ 
icant than were those never published. The impact of publication status, 
therefore, depends on whether we group the later published with the unpub¬ 
lished (as would be done at the time of the original meta-analysis) or with 
the published studies (as would be done if we gave unpublished studies more 
years to become published). 

H.B.4.C. Immediate versus delayed meta-analysis . Table 5.8 shows that 
immediate meta-analysis led to substantial publication status bias with t(97) 
= 2.29, p = .012, r = .15. However, a meta-analysis delayed to allow for 
eventual publication yielded essentially no publication status bias with t(97) 
= 0.24, p = .405, r = .05 (see Table 5.9). 
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Information relevant to the design of a meta-analytic study is provided by 
the fact that the publication delays for the originally unpublished studies 
ranged up to 13 years with a mode of 1 year and a median of about 3 years; 
after 5 years, 33 of the 35 originally unpublished studies (94%) had been 

published. 


Retrieval 

Complete 

Incomplete 

Mean 


TABLE 5.8 

Mean Z’s: Immediate Meta-Analysis 


Published 


Unpublished 


Retrieval 


Complete 

Incomplete 

Mean 


TABLE 5.9 

Mean Z’s: Delayed Meta-Analysis 


Published 


Unpublished 


0.60 18 

1.39 20 

1.02 38 


Difference 


EXERCISES 

1. For each of the six studies summarized in the exercises of Chapter 4 compute t, 
df, the one-tail p, the Z associated with each of these p’s, and the quantity -2 log e p. 

2. Combine the probabilities of the six studies using the methods of (a) adding 
logs, (b) adding p’s, (c) adding t’s, (d) adding Z’s, (e) adding weighted Z’s, (f) testing 
mean p, and (g) testing mean Z. 

3. Assume, for the moment, that the seven combined p’s computed for question 2 
are independent. Perform a test of the heterogeneity of the seven obtained p levels 
and interpret the resulting statistic and its associated p level. 

4. For the 50 studies you were able to retrieve for a meta-analysis, the mean 
standard normal deviate (Z) was .75 (associated with a one-tailed p of about .23). 
How many unretrieved studies averaging null results (Z = 0.00) must there be in 
the file drawers before the overall result would be brought to the brink of nonsig¬ 
nificance at p = .05? 

5. Imagine that we had the raw data available for all six studies of the table given 
in the exercises of Chapter 4. Explain in prose and demonstrate by numerical exam¬ 
ple how it might happen that if we pooled the raw data of all six studies we might 
obtain results opposite in direction to those we found in the exercises of Chapters 4 
and 5. 
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USTRATIONS OF META-ANALYTIC PROCEDURES 

TABLE 6.1 

Stem-and-Leaf Plot and Statistical Summary of Correlations 
Between Encoding and Decoding Skill 


Based on the results of actual meta-analyses, illustrations are provided of a variety of meta- 
analytic procedures. Examples are drawn from research on nonverbal communication 
skills, the validity of the PONS test, the detection of deception, the effects of interpersonal 
expectancies, the effects of psychotherapy, sex differences in cognitive performance, and 
hit rates in ganzfeld studies for which a one-sample effect size index, ji, is especially 
appropriate. 


In earlier chapters when various meta-analytic procedures were described 
they were often illustrated with hypothetical examples in order to keep the 
computational examples small and manageable. In the present chapter, we 
will illustrate various meta-analytic procedures with real-life meta-analytic 
examples. In principle it would be possible to illustrate almost every meta- 
analytic procedure described in this book for every meta-analysis we will be 
examining. For purposes of exposition, however, we will employ each meta- 
analytic example to illustrate a small number of principles and procedures. 

L DISPLAYING AND COMBINING EFFECT SIZES 

DePaulo and Rosenthal (1979) conducted a meta-analysis of studies of 
the relationship between skill at decoding nonverbal cues and skill at encod¬ 
ing nonverbal cues. Table 6.1 is an updated summary of the results of 19 
studies. The stem-and-leaf display and its statistical summary were encoun¬ 
tered earlier (Tables 3.8 and 3.9) as useful ways of displaying the results of a 
meta-analysis. It is more informative for the eye to follow the stem-and-leaf 
display than simply to be told that the median r was .16 or that the meanr 
was .13. 

The first five entries of the summary statistics require no explanation. 
The quantity Q 3 - Qj gives the range for the middle 50% of the effect 
sizes. The quantity .75 (Q 3 — Qj) estimates a quite accurately when the 


Correlations (r’s) 

Stem Leaf 


Summary Statistics 
(Based on r, not on z r ) 


3 5 

Maximum 

.65 

5 

Quartile 3 (Q 3 ) 

.29 


Median (Q 2 ) 

.16 

2 

Quartile 1 (Q^ 

.00 

00 189 

Minimum 

-.80 

6 

Q 3 -Q 1 

.29 

000 45 5 9 

*L75(Q3-Q,)] 

.22 


S 

.33 


Mean 

.13 


N 

19 

6 

Proportion positive sign 3 

.88 


R 


■K5 

-.6 

III 

Ijpg g ^ Q ____ 

a. Of those having signs. 

distribution is normal and, therefore, is similar to S when the distribution is 
normal. In the data of Table 6.1, Sis substantially larger than . 75 (Q 3 - Q|) 
suggesting that these effect sizes may not be normally distributed. If a more 
formal test of normality is required, the Kolmogorov-Smirnov test can be 
employed (Lilliefors, 1967; Rosenthal, 1968). 

A stem-and-leaf plot and its statistical summary can, of course, be made 
for any type of effect size estimator. For example, Rosenthal and Rosnow 
(1975, p. 23) compared the rates of volunteering for behavioral research in 
general of females and males.Their stem-and-leaf display was of 51 such 
differences in volunteering rates, i.e., the effect size estimator d\ In that 
analysis the median d' was .11 with females volunteering more than males 
on the average. This direction of difference was found in 84% of the studies 
f volunteering for behavioral research in general. 

IL COMBINING EFFECT SIZES 
AND SIGNIFICANCE LEVELS 

As part of the construct validation of the PONS test, an instrument de¬ 
signed to measure sensitivity to nonverbal cues, Rosenthal, Hall, DiMatteo, 
Rogers, and Archer (1979) conducted or located 22 studies in which the 
PONS test total score was correlated with judges’ ratings of subjects’ inter¬ 
personal or nonverbal sensitivity. Table 6.2 shows the results of the meta- 
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TABLE 6.2 

Stem-and-Leaf Plot and Statistical Summary of 
Validity Coefficients (r) for the PONS Test 


Correlations (r’s) 


Stem 

Leaf 

.5 

5 

.4 

5 6 9 

.3 

12 3 4 

.2 

0 2 2 6 9 

.1 

0 1 5 

.0 

0* 1 6 

-.0 

4 

-.1 

-.2 

9 

-.3 

5 


Summary Statistics 
_ (Based on r, not z r ) 

Maximum 
Quartile 3 (Q 3 ) 

Median (Q 2 ) 

Quartile 1 (Qj) 

Minimum 

Q 3 _ Qi 
df.75(Q 3 -Q,)] 

S 

Mean 

N 

Proportion positive sign 
Z of proportion positive 
Combined Stouffer Z 
t test of mean Z 
Correlation between r and Z 


a. p = .0003 

b. p - .0001 

c. p = .0005 

d. p = .0000002 from equation 2.3. 
*This r has a positive sign. 


analysis. The display includes all the elements of Table 6.1 relevant to com¬ 
bining effect sizes. In addition, however, three methods of combining 
probabilities were also employed and are summarized in the lower portion 
of the summary statistics. 

The first method of combining probabilities listed is one of the counting 
methods. Under the null hypothesis we expect 50% of the correlation coeffi¬ 
cients to have a positive sign. However, the present results show 86% of the 
studies to have a positive sign. Given that 19 of the 22 r’s were positive when 
only 11 were expected to be positive under the null hypothesis, we can employ 
the binomial expansion to calculate how often we expect a result that extreme, 
or more extreme, if the null hypothesis were true. Tables also are available 
to help us find the desired p (e.g., Siegel, 1956; Siegel & Castellan, 1988). 

For most practical applications, however, we can employ a normal ap¬ 
proximation to the binomial distribution that will work quite well even for 
modest sized samples: 


2P - N 

VST 
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w here P is the number of positive effect sizes obtained and N is the number 
of positive plus negative effect sizes obtained. Note that unsigned effect 
sizes are excluded from this analysis. 

For the data of Table 6.2, P = 19 and N = 22. Therefore: 


2(19) - 22 

V~22 


=3.41, p = .0003 one-tailed 


In this case, if we had employed the more laborious binomial expansion the 
p value would have been .0004 instead. In using the Z test we assign Z a 
positive value if the direction of the effect is in the predicted direction and a 
negative value if the direction is not the predicted one. 

Table 6.2 also gives the Z obtained employing Stouffer’s method. In this 
meta-analysis, the sum of the 22 Z’s was 17.48. Therefore from equation 5.4 
we have 

2Z 17.48 

—;= ?= —;= 3.73 as our obtained Z, with p = .0001, one-tailed 

\TFT \Hn 

The third method of combining significance levels employed in Table 
6.2 was the method of testing the mean Z. In this meta-analysis, the mean 
square for the 22 Z’s was .93. Therefore, from equation 5.7 we have 


17.48/22 


S 2 (Z)/N 


which, with 21 df, is significant at p = .0005 one-tailed. 

Finally, Table 6.2 reports the correlation between the 22 effect sizes (r’s) 
and their degree of statistical significance (Z’s). The r of. 86 is very large and 
can be explained on the grounds of this set of validity studies being carried out 
on fairly similar sample sizes. From equation 2.2 we know that the relation¬ 
ship between Z and r depends more on the square root of N than on N itself. 
For the 22 studies of this meta-analysis, the Vn ranges from 2.4 to 9.3, with a 
median VN of 4.4. Because there is a certain homogeneity of sample sizes 
employed in various areas of research, we have often found very substantial 
correlations between effect sizes and levels of statistical significance (Rosen¬ 
thal & Rubin, 1978a). This is a useful result, since it sometimes happens that 
we have access only to significance levels but would like to be able to make 
some guesses about effect sizes for a given area of research. On the basis of 
this result it is likely that, typically, within a given research area, more signifi¬ 
cant results will also be larger in magnitude. 





114 META-ANALYTIC PROCEDUR Es | 

III. COMBINING EFFECT SIZES AND BLOCKING 

In their meta-analytic work on the accuracy of the detection of deception 
Zuckerman, DePauIo, & Rosenthal (1981) were able to retrieve 72 result 
estimating the degree of accuracy. The magnitude of the accuracy was 
defined by r and the median of the 72 r’s was 32. In these 72 results, subjects 
were provided with various sources of information or cues to deception 
including cues from the face, the body, ordinary recorded speech, content 
filtered speech (tone of voice), and written transcription of what was said 

Table 6.3 shows the median r obtained (and the corresponding z r ) f 0r 
nine sources or channels of information or combinations of channels. The 
overall median r of. 32 convinced us that deception was detectable, but told 
us little about which channels might provide the best sources of information 
to permit detection of deception. It was to learn about the relative contribu¬ 
tion to detection of deception of various channels that we blocked or subdi¬ 
vided the 72 results into the nine subtypes of Table 6.3. 

A clearer picture of the relative contribution to the detection of decep¬ 
tion of the three major channels of face, body, and speech can be obtained by 
rearranging the first seven sources of Table 6.3 into the 2 x 2 X 2 array of 
z r ’s shown in Table 6.4. One can get a quick picture of the relative contribu¬ 
tion of these three channels and their combinations by performing an analy¬ 
sis of variance on just the eight means shown. Note that the entry for no face, 
no body, no speech is .00, a theoretical value assuming that there can be no 
accuracy when there is no information. 

The lower half of Table 6.4 shows the analysis of variance of the eight z r ’s 
of the upper half. Note that no tests of significance have been computed. Our 
purpose here is merely to get an overview of the relative magnitude of the 
sources of variance shown. The analysis shows very clearly that speech is far 
and away the most important source of cues for the detection of deception. 

TABLE 6.3 

Median Accuracy of Detecting Deception (r) 
in Nine Samples of Studies 


Sample 

Source of Information 

N of Studies 

Median r 


1 

Face and body and speech 

21 

.33 

.35 

2 

Face and body 

6 

.07 

.07 

3 

Face and speech 

9 

.45 

.48 

4 

Face 

7 

-.08 

— .08 

5 

Body and speech 

3 

.55 

.62 

6 

Body 

4 

.10 

.10 

7 

Speech 

12 

.36 

.38 

8 

Tone of voice 

4 

.06 

.06 

9 

Transcript 

6 

.40 

.42 


Median 

6 

.33 

.35 
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TABLE 6.4 

Accuracy of Detecting Deception (z r ) for Eight Sources of Information 
Arranged as a 2 x 2 x 2 Factorial Analysis of Variance 

-— ~ Speech No Speech _ 


Face 

No Face 

Face 

No Face 

.35 

.62 

.07 

.10 

.48 

.38 

-.08 

.00* 


- Proportion of 

Sources of Variance __ Total Variance _ 

^ 0098 •“ 

Body 0162 04 

Speech .3784 .86 

Face x body 0128 -03 

Face x speech .0005 00 

Body x speech .0025 .00 c 

Face x body x speech _ 0220 ^5_ 

a. Theoretical value 

ah df = 1; since no significance testing was employed, no estimation of any MS for error was 
undertaken. 

c. More precisely, .0057. 


TABLE 6.5 

Accuracy of Detecting Deception (z r ) for Four Sources of Information 
Arranged as a 2 x 2 Factorial Analysis of Variance 


Content No Content _ 


.06 

. 00 * 


Proportion of 

Sources of Variance _ MSP _ Total Variance _ 


Content .1369 .98 

Tone 0001 .00 

Content x tone .0025_02_ 


a. Theoretical value. 

b. All df * 1. 

Once we have seen that speech is the major source of information rele¬ 
vant to deception detection, we may want to get some idea of what aspect of 
speech (e.g., tone versus content) may be most important in providing rele¬ 
vant cues. Fortunately, we can address this question by employing samples 
7, 8, and 9 of Table 6.3. Table 6.5 arrays these data analogously to Table 
6.4. Again the results are clear. Of the two components of speech that could 
be examined for their relative contribution to the detection of deception, it is 
content rather than tone that provides the bulk of the useful information. 
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To recap this section, we went from an overall estimate of accuracy based 
on 72 results to a subdivision of results that could shed light on questions of 
theoretical interest to us. By arranging the subdivisions of our meta~ 
analysis, we were able to show the relative dominance of speech over visual 
cues and further, to show that within speech, content dominates tone as a 
source of cues to deception. Had we wanted more formal significance test¬ 
ing, we could have employed the methods of focused comparisons of Chap¬ 
ter 4. Later in this chapter, we provide additional illustrations of the use of 
such more formal comparisons. 

IV. COMBINING EFFECT SIZES, BLOCKING, 

AND CONFIDENCE INTERVALS 

In their meta-analysis of 345 studies of interpersonal expectancy effects, 
Rosenthal and Rubin (1978) subdivided the studies into eight areas in which 
such studies had been conducted. They employed a stratified random sam¬ 
pling procedure to estimate effect sizes (Cohen’s d) and confidence intervals 
for each of the eight areas of research. Table 6.6 shows that the entire 95% 
confidence interval for the area of animal learning lies above the confidence 
intervals for the areas of reaction time and laboratory interviews. The effects 
of experimenters’ expectancies on the performance of their animal subjects 
appears dependably greater than the effects of experimenters’ expectancies on 
the reaction time and interview responses of their human subjects. 

Table 6.6 also shows that the widest confidence interval is around the 
mean effect size of studies carried out in everyday contexts such as schools, 

FABLE 6.6 

Effects of Interpersonal Expectancy 
Obtained in Eight Areas of Research 



a. Analyses were based on a stratified random sample of 15 studies. 

b. Analyses were based on a stratified random sample of 20 studies. 

c. Confidence intervals were based on the number of studies available (not on the number of sub¬ 
jects available). 
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■ icSj an d workplaces. Computing confidence intervals for our overall 
in '-analytic results and also our subdivided or blocked meta-analyses 
^es us a good indication of the likely value of effect sizes we might expect 
^ find in the relevant population and subpopulations. Computational details 
t0 ' the meta-analysis involves stratified random sampling are given in 
Rosenthal and Rubin (1978). The last column of Table 6.6 shows that, for all 
eight research areas, there is a substantial correlation between effect size (d) 
and level of significance (Z). 

V. COMPARING EFFECT SIZES: 

EARLY COMPARISONS, FOCUSED AND DIFFUSE 

If A m uch earlier meta-analysis of studies of interpersonal expectancy ef- 
fects was conducted on 10 samples of experimenters who had stated before 
Sltheir research began the mean data they expected to obtain (Rosenthal, 
1961,1963). For each sample, then, the correlation was computed between 
the data the experimenters expected to obtain and the data the experi¬ 
menters actually did obtain. It should be noted that each of these samples 
was homogenous with regard to expectations that had been experimentally 
induced. Therefore, the obtained correlations do not assess the effects of 
experimentally-induced expectancies but only individual differences in ex¬ 
pectancies after experimenters had been given an expectancy. The range of 
expectancies held by experimenters was very much restricted because of the 
1 expectations that had been induced by the investigator. 

Table 6.7 shows the correlations between expected and obtained data for 
the 10 samples of experimenters. The purpose of this analysis was to com- 
fj pare two subsets of the 10 samples. The first five samples (1-5) of Table 6.7 
were obtained under ordinary conditions of data collection; the last five 
samples (6-10) were obtained under conditions of high reactance (Brehm, 
1966). The latter samples of experimenters had been offered special incen¬ 
tives to obtain the data they had been led to expect or had been more explic¬ 
itly instructed to bias the results of their research. The question was whether 
these “hyper-motivated” experimenters might show a higher or a lower 
correlation of their expected with their obtained data than the more ordinar¬ 
ily motivated (i.e., control) samples of experimenters. 

The last column of Table 6.7 shows the contrast weights (A’s) required to 
test the question of whether the first five r’s differ from the last five r’s. From 
equation 4.27 we have that 

2 ¥ r j _ (1)(2.65) + (1X.68) +. . .+ (-!)(-.32) _ 5.39 

1 z_ /I~W~ A- 1) 2 V3.lt 

V s V~J- + ^T + --- + 3 

= 3.06, p = .0022 two-tail. 




N of Experimenters 


For these 10 samples, therefore, experimenters exposed to greater reactance 
obtained significantly lower correlations between expected and obtained data. 

When a specific contrast is to be tested, it is not necessary first to test for 
the heterogeneity of effect sizes , just as it is not necessary to compute an 
overall F test in the analysis of variance when a contrast has been planned 
(Rosenthal & Rosnow, 1984a, 1985,1991). However, if we had wanted a test 
for the heterogeneity of these 10 effect sizes, here is how we would have 
done it. First, employing equation 4.16 we would have obtained the weighted 


X(N-3)z r 

J r j 3(2.65) + 3(.6 


.+ 3(-.32) 


£(Nj-3) 


3 + 3 +. . .+ 3 


a quantity required for use in equation 4.15, the x 2 test for the heterogeneity 
of effect size estimates: 

S(N;-3)(z -z_) 2 = 3(2.65 - ,36) 2 + 3(.68- .36) 2 4-. . .+ 3(-.32- .36) 2 = 20.46 

J j 

- x 2 (K-I) = x 2 (9), p < .02. 


Thus, the 10 effect sizes differ significantly among themselves. Note that 
a disproportionate share of this 9 df x 2 of 20.46 is associated with the contrast 
of the first five r’s versus the last five r’s. That contrast Z was 3.06 so 
its corresponding x 2 0) was Z 2 = (3.06) 2 = 9.36, which represents 46% of 
the total X 2 (9) of 20.46. The difference between the x 2 (9) and the x 2 0) 
20.46 - 9.36 ~ 11.10, the value of the resulting 8 df x 2 which is not significant 
(p = .196). 
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I y A. Some Useful (Probably Low Power) Alternatives 

jjf procedure we employed for comparing the first 5 to the last 5 effect 
sizes of Table 6.7 used all of the information in our data. That is, it was able 
to make use of the actual size of the sample employed in each of the studies 
being summarized. In this section, we note briefly some procedures that 
treat each study as a single observation so that the same result would be 
obtained whether each study employed a sample size of 10, or 100, or 1,000. 

As a first example, suppose we simply computed a t test comparing the 
mean Zj-’s of the first and last five studies of Table 6.7.That t would be 2.44, 
which, with 8 df, would be significant at p < .05 two-tailed. If each of the z r ’s 
of Table 6.7 had been based on an N of 100, this t would be unaffected. 
However, equation 4.27, which we applied to these data, would continue to 
yield more and more significant results as our sample sizes per study in¬ 
creased. 

As a second example we can apply the Mann-Whitney U test (Siegel, 
If 1956 ; Siegel & Castellan, 1988). This test asks whether the bulk of one 
population is greater than the bulk of a second population. (For a general 
discussion of this test see Siegel, 1956, or Siegel & Castellan, 1988.) For the 
very special case we have for the data of Table 6.7, i.e., the case of completely 
nonoverlapping distributions, and equal n per sample (5 and 5 for these data), 
we can estimate Z from 


This estimate works well even for samples as small as the present ones. For 
our data = n 2 = n = 5, so we find Z as 


2(5) + 1 


= 2.61, p — .009 


two-tailed. Employing Siegel’s more precise tables yields a two-tailed p of 
.008, a result that agrees very well with that of our approximation. 

The use of these two methods is not recommended as a substitute for 
equation 4.27. However, they are useful for a quick preliminary view of the 
difference between two samples of studies. In addition to the disadvantage 
that these methods cannot profit from increasing n’s per study, they also do 
not have the flexibility of equation 4.27 in permitting any kind of compari¬ 
son one might wish to make. 
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VI. COMPARING EFFECT SIZES: 

MORE RECENT COMPARISONS 

For our most comprehensive meta-analytic illustration, we consider 
“tertiary” analysis. The analysis will be of a re-analysis by Prioleau, Mur 
dock, and Brody (1983) of the seminal meta-analysis by Smith et al. (1980) 
In general terms, the re-analysis by Prioleau et al., as well as an earlier 
re-analysis by Landman and Dawes (1982) support the conclusions drawn 
by Smith et al. (1980). 

Prioleau et al. (1983) examined a subset of studies comparing the effects 
of psychotherapy to the effects of placebo treatments. In what follows 
we examine this subset of studies within the framework of meta-analyti c 
procedures described in this book and as presented elsewhere (Rosenthal 
1983b). 

Table 6.8 summarizes the result of the present meta-analysis. The 32 
studies were divided (or blocked) into five groups. The first three groups 
were entirely comprised of students divided on the basis of age level into 
elementary, secondary, and college level. The last two groups were entirely 
comprised of patients divided on the basis of the type of placebo em¬ 
ployed—psychological versus medical. The psychological placebo patients 
(as well as all the student groups) were those who received some form of 
placebo that could have been viewed by patients as psychological in some 
sense. The medical placebo patients were those who received only a pill 
placebo, i.e., they received only a “medical” placebo treatment. 

The first two rows of Table 6.8 show the number of studies summarized 
and the total number of persons whose data entered into a determination of 
the average size of the effect (Hedges’s g). The third row gives the mean g for 
each group, the fourth and fifth rows give the standard normal deviate (Z) and 
the p level associated with each mean g. The college students and the patients 
who, like the college students, were given psychological placebos both 
showed substantial benefits of psychotherapy relative to placebo controls and 
these differences were significant at p well less than .001. The grand mean 
effect size of .24 (p < .000004, one-tailed) was smaller than that obtained by 
Prioleau et al. and by Smith et al. because it was computed with weighting 
inversely as the variance of g as shown in equations 4.18 and 4.3. 

Rows 6, 7, and 8 of Table 6.8 give the results of tests of heterogeneity of 
effect sizes, i.e., tests of whether the g’s in each set of studies differ signifi¬ 
cantly among themselves. Studies of elementary school children and of pa¬ 
tients receiving psychological placebo yielded g’s that were significantly 
heterogeneous. (See equations 4.17 and 4.18.) 

Lines 9, 10, and 11 address the question of the relationship between the 
size of the study and the size of the g. The Z’s and p’s for linear contrasts 
show that among elementary school children larger g’s were found in 
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TABLE 6.9 

Contrasts Among Five Groups of Studies 
of Psychotherapy Effects 


Contrast 

Z 

p (one-tailed) 

Students versus patients 

.20 

.42 

Linear trend in age of students 

2.27 

.012 

Quadratic trend in age of students 

2.08 

.019 

Psychological versus medical placebo 

2.18 

.015 


smaller studies (p « .000004; r =* 24), but among patients receiving 

psychological placebos, larger g’s were found in larger studies (p = .009; r = 
.16). Thus, although for all studies combined, larger g’s are associated with 
smaller studies, there are statistically significant reversals of this overall 
relationship. 

Table 6.9 shows that the mean g’s of the five groups examined can be 
compared meaningfully within the framework of a set of four contrasts em¬ 
ploying equations 4.28 and either 4.3 or 4.21. The first contrast shows that 
there is little difference between students and patients in the degree to 
which psychotherapy is more effective than placebo. The second contrast 
shows that with increasing age of students, greater g’s are obtained. The 
third contrast shows that the average of the elementary and college student 
groups yields a larger g than does the group of secondary students. (In inter¬ 
preting these contrasts in age we should note that age is likely to be con¬ 
founded here with such variables as IQ, type of treatment, type of placebo 
control, and so forth.) The fourth contrast shows that psychotherapy is more 
effective relative to psychological than to medical placebo controls. Perhaps 
pill placebos are so effective that it is difficult for psychotherapy to be supe¬ 
rior to them. 

To address this last question, to help understand the significant linear and 
quadratic contrasts in age, the meaning of the sometimes positive, some 
times negative correlation between g and N, and the significant heteroge¬ 
neity of g’s found among studies of elementary school children and studies 
of patients given psychological placebo, additional studies will be required. 

This section was designed to illustrate how the systematic application of 
various meta-analytic procedures can lead to firmer inferences about a do¬ 
main of research. At the same time, however, it should be clear that meta¬ 
analyses need not close off further research in an area. Indeed, they may be 
employed to help us formulate more clearly just what that research should be 

A similar set of meta-analytic procedures was recently carried out in the 
research area of sex differences in cognitive functioning (Rosenthal & Ru¬ 
bin, 1982b). In that analysis, we showed that, in the four areas of cognitive 
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nctioning investigated by Hyde (1981), effect sizes were significantly het- 
un nous. In addition, we showed that in all four areas, studies conducted 
^° re recently showed a substantial gain in cognitive performance by fe- 

_ _ . es relative to males (unweighted mean r = .40). 

-tailed) ® 

12 ~~ ~ 1 US VII. A ONE-SAMPLE EFFECT SIZE INDEX: n 

)12 

>19 ^ 11 0 f the effect size indices we have employed so far in this chapter were 

Hi_] tw0 . s ample or multi-sample indices, e.g., d, g, and r. Such effect size indices 

are not directly applicable, however, to areas of research employing one- 
ents receivi sam ple studies in which performance is compared to some theoretical value; 

5 (p _ 009' 0 ften the level expected if performance is no better than that expected if the 

issociated with U null hypothesis were true. In so-called ganzfeld studies, for example, subjects 

of this over II are asked to guess which of four or five or six stimuli had been “transmitted” 

3 an agent or sender (Harris & Rosenthal, 1988; Honorton, 1985; Hyman, 

imined can be 1985; Rosenthal, 1986). A measure of effect size, it, has been developed for 

• contrasts em |his type of one-sample situation (Rosenthal & Rubin, 1989). This index is 

ast shows that * expressed as the proportion of correct guesses if there had been only two 

the degree to choices to choose from. When there are more than two choices, Jt converts 

xond contrast 1(1 Hi 6 proportion of hits to the proportion of hits made if there had been only 
obtained. The ) two equally likely choices: 

ollege student P(k - 1) .. 

ents. (In inter-SMI Bi ll : K = P(k - 2) + 1 


where p = the raw proportion of hits and k is simply the number of alternative 
choices available. 

The standard error of it is given by: 

1 / n(l - n) \ r<: .. 


■ 

■ 


SE(n) = VN VP(1“ P) 


so that we can test the significance of a given jt by means of the following Z 
test: 

z - 5=^2 [6.5] 

bfc (n) 

Confidence intervals are also readily available, e.g., for 95% confidence 
intervals we employ: 

jt ± 1.96 SEr„> [6.61 
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Contrasts among independent jt’s can be tested by: 



Jij 

SE^) 

[6 

Finally, we can test the heterogeneity of a set of it’s from the followi 
relationship: 


/jtj - Jt\ 2 

X 2 (K - 1) = 2 ' 

\ / 

[6 

where 

2 Wj ftj 

[6 

and 

W) (SE W .) 2 

[6.1 


TABLE 6.10 

8tem-and-Leaf Plot of Effect Sizes (jt) for 28 
“Direct Hit” Ganzfeld Studies 3 
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TABLE 6.11 

Statistical Summary of 28 “Direct Hit” Ganzfeld Studies 


C^al tendency (n) 

Unweighted mean .62 

Weighted 3 mean .62 

Median 

proportion > .50 .82 

Significance tests 

Combined Stouffer Z 6.60 

t test of mean jt - .50 3.39 

Z of proportion > .50 3.40 

Confidence interval 

From To 
95% .55 .70 

99% .52 .72 

■ 99.5% .51 .73_ 


By number of trials per study; total number of 
b. Based on N of 28 studies. 


Variability 
Maximum 
Quartile3 (Q 3 ) 

Median (Q 2 ) 

Quartile 1 (Q^ 
Minimum 
Q 3 - Q t 

o: [.75(Q3 - Ql)] 

We.) 

Robustness -- 


trials = 835. 



For examples of the application of all these equations see Rosenthal and 
Rubin (1989). 

Table 6.10 shows the stem-and-leaf plot of effect sizes (rc) for 28 ganzfeld 
studies, and Table 6.11 shows the statistical summary of these data. 

As a slightly more complex index of the stability, replicability, or clarity 
of the average effect size found in the set of replicates, one could employ the 
mean effect size divided either by its standard error (S/VN where N is the 
total number of replicates), or simply by S. The latter index of mean effect 
size divided by its standard deviation (S) is the reciprocal of the coefficient 
of variation or a kind of coefficient of robustness (Rosenthal, 1990). 


VILA. The Coefficient of Robustness of Replication 
Although the standard error of the mean effect size along with confidence 
intervals placed around the mean effect size are of great value (Rosenthal & 
Rubin, 1978), it will sometimes be useful to employ a robustness coefficient 
that does not increase simply as a function of the increasing number of 
replications. Thus, if we want to compare two research areas for their 
robustness, adjusting for the difference in number of replications in each 
research area, we may prefer the robustness coefficient defined as the 
reciprocal of the coefficient of variation. 


a. Probability of a direct hit =.50 for the effect size index n. 









126 META-ANALYTIC PROCEDURE 

The utility of this coefficient is based on two ideas —first, that replication 
success, clarity, or robustness depends on the homogeneity of the obtained 
effect size, and second, that it depends also on the unambiguity or clarity 0 f 
the directionality of the result. Thus, a set of replications grows in robustness 
as the variance of the effect size decreases and as the distance of the mean 
effect size from zero increases. Incidentally, the mean may be weighted 
unweighted, or trimmed (Tukey, 1977). Indeed, it need not be the mean at 
all but any measure of location or central tendency (e.g., the median). 


The Evaluation of Meta-Analytic 
Procedures and Meta-Analytic Results 


Criticisms of the meta-analytic enterprise are described and discussed under the general 
headings of sampling bias, information loss, problems of heterogeneity of method and of 
quality? problems of independence, exaggeration of significance levels, and the practical 
importance of any particular estimated effect size. 
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We have had an opportunity to examine a variety of meta-analytic proce¬ 
dures so that we would now be able to carry out meta-analyses of research 
areas. But should we want to? The purposes of this final chapter are to 
examine some negative evaluations of meta-analysis and to evaluate the 
merits of these evaluations. 

In the years 1980, 1981, and 1982 alone, well over 300 papers were 
published on the topic of meta-analysis (Lamb & Whitla, 1983), and the rapid 
growth continues (Hunter & Schmidt, 1990). Does this represent a giant 
stride forward in the development of the behavioral and social sciences or 
does it signal a lemming-like flight to disaster? Judging by the reactions to 
past meta-analytic enterprises, there are some who take the more pessimistic 
view. Some three dozen scholars were invited to respond to the meta-analysis 
of studies of interpersonal expectancy effects (Rosenthal & Rubin, 1978). 
Although much of the commentary dealt with the substantive topic of 
interpersonal expectancy effects, a good deal of it dealt with methodological 
aspects of meta-analytic procedures and products. Some of the criticisms 
offered were accurately anticipated by Glass (1978) who had earlier received 
commentary on his meta-analytic work (Glass, 1976) and that of his col¬ 
leagues (Smith & Glass, 1977; Glass et al., 1981). In this chapter, the 
riticisms of our commentators are grouped into a half-dozen conceptual 
ategories, described, and discussed. 
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L SAMPLING BIAS AND THE 
FILE DRAWER PROBLEM 

This criticism holds that there is a retrievability bias such that studies 
retrieved do not reflect the population of studies conducted. One version of 
this criticism is that the probability of publication is increased by the statis¬ 
tical significance of the results so that published studies may not be repre¬ 
sentative of the studies conducted. This criticism is well taken although it 
applies equally to traditional narrative reviews of the literature. One set of 
procedures that can be employed to address this problem was described in 
Chapter 5 when the file drawer problem was discussed. 

A bizarre version of this criticism simply holds that the unretrieved studies 
are essentially a mirror image of the retrieved studies (Rosenthal & Rubin, 
1978). Thus if the combined Z for 100 studies is +6.50, there is postulated to 
be, in the file drawers, another set of studies with combined Z = -6.50! No 
mechanism whereby this phenomenon may operate has been proposed and no 
reply to this criticism seems possible. One can too easily postulate a universe 
in which for every observed outcome there is an unobserved outcome equal 
and opposite in magnitude and/or in significance level. 

II. LOSS OF INFORMATION 
II. A. Overemphasis on Single Values 

The first of two criticisms relevant to information loss notes the danger of 
trying to summarize a research domain by a single value such as a mean 
effect size. This criticism holds that defining a relation in nature by a single 
value leads to overlooking moderator variables. The force of this criticism 
is removed when meta-analysis is seen as including not only combining 
effect sizes (and significance levels) but also comparing effect sizes in both 
diffuse and, especially, in focused fashion. 

Il.A.J. Overlooking negative instances. A special case of the criticism 
under discussion is that, by emphasizing average values, negative cases are 
overlooked. There are several ways in which negative cases can be defined; 
e.g., p > .05, r = 0, r negative, r significantly negative, and so on. However 
we may define negative cases, when we divide the sample of studies into 
negative and positive cases we have merely dichotomized an underlying 
continuum of effect sizes or significance levels and accounting for negative 
cases is simply a special case of finding moderator variables. 

II.B. Glossing Over Details 

Although it is accurate to say that meta-analyses gloss over details, it is 
equally accurate to say that traditional narrative reviews do so and that data 
analysts do so in every study in which any statistics are computed. Summa¬ 
rizing requires us to gloss over details. If we describe a nearly normal 
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distribution of scores by the mean and o we have nearly described the 
distribution perfectly. If the distribution is quadrimodal, the mean and o will 
n ot do a good job of summarizing the data. It is the data analyst’s job in the 
individual study, and the meta-analyst’s job in meta-analysis, to “gloss well.” 
providing the reader with all the raw data of all the studies summarized avoids 
this criticism but serves no useful review function. Providing the reader with 
a stem-and-leaf display of the effect sizes obtained, along with the results of 
the diffuse and focused comparisons of effect sizes, does some glossing but 
it does a lot of informing besides. 

There is, of course, nothing to prevent the meta-analyst from reading 
each study as carefully and assessing it as creatively as might be done by a 
more traditional reviewer of a literature. Indeed, we have something of an 
operational check on reading articles carefully in the case of meta-analysis. 
If we do not read the results carefully, we cannot obtain effect sizes and 
significance levels. In traditional reviews, results may have been read care¬ 
fully or not read at all with the abstract or the discussion section providing 
“the results” to the more traditional reviewer. 

III. PROBLEMS OF HETEROGENEITY 
if III. A. Heterogeneity of Method 

The first of two criticisms relevant to problems of heterogeneity notes 
that meta-analyses average over studies in which the independent variables, 
the dependent variables, and the sampling units are not uniform. How can 
we speak of interpersonal expectancy effects, meta-analytically, when some 
of the independent variables are operationalized by telling experimenters 
that tasks are easy versus hard or by telling experimenters that subjects are 
good versus poor task performers? How can we speak meta-analytically of 
these expectancy effects when sometimes the dependent variables are reac¬ 
tion times, sometimes IQ test scores, and sometimes responses to inkblots? 
How can we speak of these effects when sometimes the sampling units are 
rats, sometimes college sophomores, sometimes patients, sometimes pu¬ 
pils? Are these not all vastly different phenomena? How can they be pooled 
together in a single meta-analysis? 

Glass (1978) has eloquently addressed this issue—the apples and or¬ 
anges issue. They are good things to mix, he wrote, when we are trying to 
generalize to fruit. Indeed, if we are willing to generalize over subjects 
within studies, why should we not be willing to generalize over studies? If 
subjects behave very differently within studies we block on subject charac¬ 
teristics to help us understand why. If studies yield very different results 
from each other, we block on study characteristics to help us understand 
why. It is very useful to be able to make general statements about fruit. If, in 
addition, it is also useful to make general statements about apples, about 
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oranges, and about the differences between them, there is nothing in meta 
analytic procedures to prevent us from doing so. Indeed, Chapter 4 
dally deals with these procedures in detail. 

III.B. Heterogeneity of Quality 

One of the most frequent criticisms of meta-analyses is that bad studies 
are thrown in with good. This criticism must be broken down into two 
questions: (1) What is a bad study? and (2) What shall we do about bad 
studies? 

a 

(1 

III.B.l. Defining “bad” studies. Too often, deciding what is a bad study 
is a procedure unusually susceptible to bias or to claims of bias (Fiske, 
1978). Bad studies are too often those whose results we do not like or, as 
Glass et al. (1981) have put it, the studies of our “enemies.” Therefore when 
reviewers of research tell us they have omitted the bad studies, we should 
satisfy ourselves that this has been done by criteria we find acceptable. A 
discussion of these criteria (and the computation of their reliability) can be 
found in Chapter 3. 

III.B.2. Dealing with bad studies. The distribution of studies on a dimen¬ 
sion of quality is of course not really dichotomous (good versus bad) but 
continuous with all possible degrees of quality. Because we dealt with the 
issue in detail in Chapter 3, we can be brief here: The fundamental method 
of coping with bad studies or, more accurately, variations in the quality of 
research, is by differential weighting of studies. Dropping studies is merely 
the special case of zero weighting. 

The most important question to ask about study quality is asked by Glass 
(1976): Is there a relationship between quality of research and effect size 
obtained? If there is not, the inclusion of poorer quality studies will have no 
effect on the estimate of the average effect size though it will help to de¬ 
crease the size of the confidence interval around that mean. If there is a 
relationship between the quality of research and effect size obtained, we can 
employ whatever weighting system we find reasonable (and that we can 
persuade our colleagues and critics also to find reasonable). 


IV. PROBLEMS OF INDEPENDENCE 
IV.A. Responses within Studies 

The first of two criticisms relevant to problems of independence 
that several effect size estimates and several tests of significance may 
generated by the same subjects within each study. This can be a very apt 
criticism under some conditions. Chapter 2 deals with the problem in dt 


Studies within Sets of Studies 

Even when all studies yield only a single effect size estimate and level of 
significance and even when all studies employ sampling units that do not 
also appear in other studies, there is a sense in which results may be non- 
independent. That is, studies conducted in the same laboratory, or by the 
same research group, may be more similar to each other (in the sense of an 
intraclass correlation) than they are to studies conducted in other laborato¬ 
ries or by other research groups (Jung, 1978; Rosenthal, 1966,1969, 1979, 
1990b; Rosenthal & Rosnow, 1984a). The conceptual and statistical impli¬ 
cations of this problem are not yet worked out. However, there are some 
preliminary data bearing on this issue that are at least moderately reassuring. 

Table 7.1 shows a series of 94 studies blocked or subdivided into seven 
areas of research on interpersonal expectancy effects (Rosenthal, 1969). For 
each area, the combined Z was computed; once based on the n of studies in 
that area, and once based on the n of laboratories or principal investigators. 
For most of the research areas there is little difference in n between studies 
and laboratories so there is little difference in their Z’s. The only noticeable 
difference in Z’s is for the research area in which there were substantially 
more studies (n = 57) than there were laboratories (n = 20). Even there, 
however, it seems unlikely that we would have drawn very different conclu¬ 
sions from these two methods of analysis. 

Perhaps the most important result, however, is seen when we compare 
the overall Z for all 94 studies with the overall Z for the 48 laboratories. 
There is less than a 3% decrease in the combined Z when we go from the 
analysis per study to the analysis per laboratory. It would be useful if similar 
analyses employing effect size estimates were available. 


TABLE 7.1 

Significance Levels Computed Separately 
for Studies and for Laboratories 


Research Area 

Animal learning 
Learning and ability 
Psychophysical j udgments 
Reaction time 
Inkblot tests 
Laboratory interviews 
Person perception 


All studies _ 9.82 94 a 9.55 48 a _ .27 

a. Three entries were nonindependent and the mean Z across areas was used for the single 
independent entry. 


Studies 

Z n 

Laboratories 

Z n 

Difference 
in Z’s 

8.64 

9 

8.46 

5 

.18 

3.01 

9 

2.96 

8 

.05 

2.55 

9 

2.45 

6 

.10 

1.93 

3 

1.93 

3 

.00 

3.55 

4 

3.25 

3 

.30 

5.30 

6 

5.30 

6 

.00 

4.07 

57 

J 2.77 

20 

1.30 

9.82 

94 a 

9.55 

48 a 

.27 
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V. EXAGGERATION OF SIGNIFICANCE LEVELS 

V.A. Truncating Significance Levels 

It has been suggested that all p levels less than .01 (Z values greater th 
2.33) be reported as .01 (Z - 2.33) because p’s less than .01 are likely to T 
in error (Elashoff, 1978). This truncating of Z’s cannot be recommended a d 
will, in the long run, lead to serious errors of inference (Rosenthal & 

1978). If there is reason to suspect that a given p level < .01 is in error it 
should, of course, be corrected before employing it in the meta-analysis It 
should not, however, be changed to p = .01 simply because it is less than 

V.B. Too Many Studies 

It has been noted as a criticism of meta-analyses that as the number of 
studies increases, there is a greater and greater probability of rejecting the 
null hypothesis (Mayo, 1978). When the null hypothesis is false and, there- 
fore, ought to be rejected, it is indeed true that adding observations (either 
sampling units within studies or new studies) increases statistical power 
However, it is hard to accept as a legitimate criticism of a procedure, a 
characteristic that increases its accuracy and decreases its error rate—4 n 
this case, type II errors. When the null hypothesis is really true, of course, 
adding studies does not lead to increased probability of rejecting the null 
hypothesis. Adding studies, it should also be noted, does not increase the 
size of the estimated effect. 

A related feature of rneta-analysis is that it may, in general, lead to a 
decrease in type II errors even when the number of studies is modest. The 
empirical support for this was described in Chapter 1 when the research by 
Cooper and Rosenthal (1980) was summarized. Procedures requiring the 
research reviewer to be more systematic and to use more of the information 
in the data seem to be associated with increases in power, i.e., decreases in 
type II errors. 


VI. THE PRACTICAL IMPORTANCE OF 
THE ESTIMATED EFFECT SIZE 

Mayo (1978) criticized Cohen (1977) for calling an effect size large (d = 
.80) when it accounted for “only” 14% of the variance. Similarly, Rimland 
(1979) felt that the Smith and Glass (1977) meta-analysis of psychotherapy 
outcome studies sounded the death knell for psychotherapy because the 
effect size was equivalent to an r of .32 accounting for “only” 10% of the 


VI. A. The Binomial Effect Size Display (BESD) 

Despite the growing awareness of the importance of estimating effect 
sizes, there is a problem in evaluating various effect size estimators from 
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the point of view of practical usefulness (Cooper, 1981). Rosenthal and Ru- 
hin ( 1979 b, 1982c) found that neither experienced behavioral researchers 
or experienced statisticians had a good intuitive feel for the practical mean¬ 
ing of such common effect size estimators as r 2 , omega 2 , epsilon 2 , and simi- 
Wfyx estimates. 

Accordingly, Rosenthal and Rubin introduced an intuitively appealing 
general purpose effect size display whose interpretation is perfectly trans¬ 
parent: the binomial effect size display (BESD). There is no sense in which 
they claim to have resolved the differences and controversies surrounding 
the use of various effect size estimators but their display is useful because it 
is easily understood by researchers, students, and lay persons, applicable in 
a wide variety of contexts, and conveniently computed. 

The question addressed by BESD is: what is the effect on the success rate 
(e.g-, survival rate, cure rate, improvement rate, selection rate, and so on) of 
the institution of a new treatment procedure, a new selection device, or a 
new predictor variable? It therefore displays the change in success rate (e.g., 
survival rate, cure rate, improvement rate, accuracy rate, selection rate, etc.) 
attributable to the new treatment procedure, new selection device, or new 
predictor variable. An example shows the appeal of the display. Suppose the 
estimated mean effect size were found to be an r of .32, approximately the 
size of the effects reported by Smith and Glass (1977) and by Rosenthal and 
Rubin (1978) for the effects of psychotherapy and of interpersonal expect¬ 
ancy effects, respectively. 

Table 7.2 is the BESD corresponding to an r of .32 or an r 2 of. 10. The 
table shows clearly that it is absurd to label as modest an effect size equiva¬ 
lent to increasing the success rate from 34% to 66% (e.g., reducing a death 
rate from 66% to 34%). Even so small an r as .20, accounting for “only” 4% 
of the variance is associated with an increase in success rate from 40% to 
60%, e.g., a decrease in death rate from 60% to 40%, hardly a trivial effect. 
It might be thought (e.g., Hunter & Schmidt, 1990, p. 202) that the BESD can 
be employed only for dichotomous outcomes (e.g., alive vs. dead) and not 
for continuous outcomes (e.g., scores on a Likert-type scale of improvement 
due to psychotherapy, or gains in performance due to favorable interpersonal 
expectations). Fortunately, however, the BESD works well for both types of 
outcomes under a wide variety of conditions (Rosenthal & Rubin, 1982c). 

A great convenience of the BESD is how easily we can convert it to r (or 
r 2 ) and how easily we can go from r (or r 2 ) to the display. 

Table 7.3 shows systematically the increase in success rates associated 
with various values of r 2 and r. For example, an r of .30, accounting for 
“only” 9% of the variance is associated with a reduction in death rate from 
65% to 35%, or more generally with an increase in success rate from 35% to 

ToKI^ 1 'X chr»xi/c that thp difference in success rates 
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Success Rate Increase 


Difference in 


Success Rates 1 


Treatment Result 


Condition 

Alive 

Dead 

2 

Treatment 

66 

34 

100 

Control 

34 

66 

100 

2 

100 

100 

200 


is identical to r. Consequently, the experimental group success rate in the 
BESD is computed as .50 + r/2 whereas the control group success rate is 
computed as .50 - r/2. 

VLB. The Propranolol Study and the BESD 

On October 29, 1981, the National Heart, Lung, and Blood Institute 
officially discontinued its placebo-controlled study of propranolol because 
the results were so favorable to the treatment that it would be unethical to 
keep the placebo control patients from receiving the treatment (Kolata, 
1981). The two-year data for this study were based on 2108 patients and 
X 2 (l) was approximately 4.2. What then, was the size of the effect that led 

TABLE 7.3 

Changes in Success Rates (BESD) 

Corresponding to Various Values of r 2 and r 


bus the propranolol study was discontinued for an effect accounting for 
/5th of 1% of the variance! To display this result as a BESD we take the 
are root of r 2 to obtain the r we use for the BESD. That r is about .04 
fhichdisplays as shown in Table 7.4. As behavioral researchers we are not 
-ustomed to thinking of r’s of .04 as reflecting effect sizes of practical 
■tance. If we were among the 4 per 100 who moved from one outcome 
the other, we might well revise our view of the practical import of small 

w»ts! 


TABLE 7.4 

The Binomial Effect Size Display 
for the Discontinued Propranolol Study 


Treatment Result 
Alive Dead 


Equivalent to a 


.00 

.00 

.00 

.01 

.01 

.01 

.03 

.04 

.06 

.09 

.16 

.25 

.36 

.49 

.64 

.81 

1.00 


.02 

.04 

.06 

.08 

.10 

.12 

.16 

.20 

.24 

.30 

.40 

.50 

.60 

.70 

.80 

.90 

1.00 


.51 

.52 

.53 

.54 

.55 

.56 

.58 

.60 

.62 

.65 

.70 

.75 

.80 

.85 

.90 

.95 

1.00 


.02 

.04 

.06 

.08 

.10 

.12 

.16 

.20 

.24 

.30 

.40 

.50 

.60 

.70 

.80 

.90 

1.00 


a. The difference in success rates in a BESD is identical to r. 


Condition 


Propranolol 


This type of result seen in the propranolol study is not at all unusual in 
biomedical research (Rosenthal, 1990a). Some years later, on December 18, 
1987, it was decided to end prematurely a randomized double blind experi¬ 
ment on the effect of aspirin on reducing heart attacks (Rosnow & Rosenthal, 
1989; Steering Committee of the Physicians Health Study Research Group, 
1988). The reason for this termination of this large study (N - 22,071) was 
that aspirin was so effective in preventing heart attacks (and deaths) that it 
would be unethical to continue to give half the physician subjects a placebo. 
The r 2 for this important effect was about half the size of the r 2 for the 
propranolol effect, .0011 versus .0020, and r for the aspirin effect was .034. 
Table 7.5 summarizes these results and presents several others that also yield 
“small” r’s despite being results of major medical, behavioral, and/or eco¬ 
nomic importance. 


jJtUATlON OF PROCEDURES AND RESULTS 135 

Institute to break off its study? Was the use of propranolol accounting 
^90% of the variance in death rates? Was it 50% or 10%, the overly 
or de st effect size that should prompt us to give up psychotherapy? From 
-Lion 2.15, we find the proportion-of-variance-accounted-for (r 2 ): 
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TABLE 7.2 

The Binomial Effect Size Display (BESD) for an r of .32 
that Accounts for “Only” 10% of the Variance 
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TABLE 7.5 

Effect Sizes of Seven Independent Variables 


Independent Variable 


Dependent Variable 


Aspirin 3 Heart attacks 

Propranolol 3 Death 

Vietnam veteran status b Alcohol problems 

Testosterone 0 Adult delinquency 

Cyclosporine d Death 

ACT 6 Death 

Psychotherapy 3 _ Improvement 

a. See text for references. 

b. Centers for Disease Control Vietnam Experience Study, 1988. 

c. Dabbs & Morris, 1990. 

d. Canadian Multicentre Transplant Study Group, 1983. 

e. Barnes, 1986. 


VI.C. Concluding Note on Interpreting Effect Sizes 

Rosenthal and Rubin (1982c) proposed that the reporting of effect sizes 
could be made more intuitive and more informative by using the BESD If 
was their belief that the use of the BESD to display the increase in success 
rate due to treatment would more clearly convey the real-world importance 
of treatment effects than would the commonly used descriptions of effect 
size, especially those based on the proportion of variance accounted for. 

One effect of the routine employment of a display procedure such as the 
BESD to index the practical meaning of our research results would be to give 
us more useful and realistic assessments of how well we are doing as 
researchers in applied social and behavioral science and in the social and 
behavioral sciences more generally. Employment of the BESD has, in fact, 
shown that we are doing considerably better in our “softer” sciences than w' 
thought we were. 
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